Debugging in the (Very) Large
Kinshuman Kinshumann, Kirk Glerum, Steve Greenberg, Gabriel Aul, Vince Orgovan, Greg Nichols, David Grant, Gretchen Loihle, and Galen Hunt: Debugging in the (Very) Large: Ten Years of Implementation and Experience. Communications of the ACM, 54(7), July 2011.
Windows Error Reporting (WER) is a distributed system that automates the processing of error reports coming from an installed base of a billion machines. WER has collected billions of error reports in 10 years of operation. It collects error data automatically and classifies errors into buckets, which are used to prioritize developer effort and report fixes to users. WER uses a progressive approach to data collection, which minimizes overhead for most reports yet allows developers to collect detailed information when needed. WER takes advantage of its scale to use error statistics as a tool in debugging; this allows developers to isolate bugs that cannot be found at smaller scale. WER has been designed for efficient operation at large scale: one pair of database servers records all the errors that occur on all Windows computers worldwide.
Engineering is in part what we do when quantitative differences become qualitative differences—when the ten pounds that we can lift easily becomes twenty, then two hundred. This paper is a very readable overview of how Microsoft's error reporting team has handled such a change in scale. WER's carefully-tuned bucketing and aggregation helps developers pinpoint errors and prioritize their work, so that the things that will affect the most people get the most attention. The authors' discussion of how this works, and of the insights that such big data can provide, are a great example of how innovative practices can open up new areas of research.