Problem Detection and Diagnosis in Large-Scale Distributed Systems

Recovery-Oriented Computing Research Group
University of California, Berkeley

We are attempting to bridge the gap between the knowledge posessed by the architects, implementers, and operators  of large-scale Internet services, and that of the academic community, particularly with respect to the causes of failures and techniques for detecting and diagnosing them.

If you would like to help us with this project by contributing quantitative or qualitative information about the causes and impact of failures in your service, and/or the techniques that you have found useful (and not useful!) for detecting and diagnosing failure causes, please send email.

 

This work is motivated by the paper "Why do Internet service fail, and what can be done about it?"

 

 

 For more information, contact David Oppenheimer