The Berkeley/Stanford
Recovery-Oriented Computing (ROC)
Project

The Recovery-Oriented Computing (ROC) project is a joint Berkeley/Stanford research project that is investigating novel techniques for building highly-dependable Internet services.

Read an overview of our research into Recovery-Oriented Computing.

ROC News

New RAD Lab project and website!
- Summer 2005 RADS Retreat
Winter 2005 ROC Retreat
Please participate in our BOINC projects collecting Windows crash data and resource availability.
Laying the foundation for the next-generation of distributed computing. EE Times, November 2004
CS 444A/294-4 Reliable Adaptive Distributed Systems (RADS) Class, Fall 2004
Summer 2004 ROC Retreat
Release of initial implementation of Undo for System Administrators and Operators (v0.1) (12 June 2003)
Self-Repairing Computers. Scientific American, June 2003.
CHI 2003 workshop "System Administrators are Users, Too: Designing Workspaces for Managing Internet-scale Systems", April 7th, 2003.
Autonomic Computing. Scientific American, May 2002.

People

Faculty

Dave Patterson (Berkeley)
Armando Fox (Stanford)

Berkeley Graduate Students

Aaron Brown
Pete Broadwell
Mike Chen
Archana Ganapathi
David Oppenheimer
Divya Ramachandran
Peter Bod�k
Wei Xu

Stanford Graduate Students

George Candea
Andrew Huang
Emre Kiciman
Ben Ling

Other Alumni

Group Photographs

Santa Cruz (April 2002)

Courses

Fall 2004

Berkeley CS294-4/Stanford CS444A "Reliable Adaptive Distributed Systems".

Fall 2001

Berkeley CS294-4/Stanford CS444A "Recovery-Oriented Computing Seminar".

Publications

General ROC

Patterson, D. A., A. Brown, P. Broadwell, G. Candea, M. Chen, J. Cutler, P. Enriquez, A. Fox, E. Kiciman, M. Merzbacher, D. Oppenheimer, N. Sastry, W. Tetzlaff, J. Traupman, N. Treuhaft. Recovery-Oriented Computing (ROC): Motivation, Definition, Techniques, and Case Studies. UC Berkeley Computer Science Technical Report UCB//CSD-02-1175, March 15, 2002.
Brown, A. and D. A. Patterson. Embracing Failure: A Case for Recovery-Oriented Computing (ROC).2001 High Performance Transaction Processing Symposium, Asilomar, CA, October 2001.
Brown, A. and D. A. Patterson. To Err is Human. Proceedings of the First Workshop on Evaluating and Architecting System dependabilitY (EASY '01), G�teborg, Sweden, July 2001.
Brown, A. Accepting Failure: Availability through Repair-Centric System Design. UC Berkeley Qualifying Examination Proposal, Berkeley, CA, April 2001.
G. Candea, A. Brown, A. Fox, D. Patterson, Recovery-Oriented Computing: Building Multitier Dependability, IEEE Computer, November 2004.
G. Candea, E. Kiciman, S. Kawamoto, A. Fox, Autonomous Recovery in Componentized Internet Applications, Cluster Computing Journal, Vol. 9, No. 1, February 2006.
G. Candea, E. Kiciman, S. Zhang, P. Keyani, A. Fox, JAGR: An Autonomous Self-Recovering Application Server, Proc. 5th International Workshop on Active Middleware Services, Seattle, WA, June 2003
G. Candea and A. Fox, Crash-Only Software, Proc. 9th Workshop on Hot Topics in Operating Systems (HotOS-IX), Lihue, Hawaii, May 2003.

ROC Techniques

Brown, A. A Recovery-Oriented Approach to Dependable Services: Repairing Past Errors With System-Wide Undo, UC Berkeley Computer Science Division Technical Report UCB//CSD-04-1304, December 2003. [abstract]
Brown, A. and D. A. Patterson. Undo for Operators: Building an Undoable E-mail Store. In Proceedings of the 2003 USENIX Annual Technical Conference, San Antonio, TX, June 2003 (Best Paper Award).
Brown, A. and D. A. Patterson. Rewind, Repair, Replay: Three R's to Dependability. 10th ACM SIGOPS European Workshop, Saint-Emilion, France, September 2002.
George Candea and Armando Fox. A Utility-Centered Approach to Building Dependable Infrastructure Services, Appears in Proc. 10th ACM SIGOPS European Workshop (EW-2002), Saint-�milion, France, September 2002.
Broadwell, P., N. Sastry and J. Traupman. FIG: A Prototype Tool for Online Verification of Recovery Mechanisms. Workshop on Self-Healing, Adaptive and self-MANaged Systems (SHAMAN), New York, NY, June 2002.
George Candea, James Cutler, Armando Fox, Rushabh Doshi, Priyank Garg, Rakesh Gowda. Reducing Recovery Time in a Small Recursively Restartable System. Proc. International Conference on Dependable Systems and Networks (DSN-2002), Washington, D.C., June 2002.
George Candea and Armando Fox. Recursive Restartability: Turning the Reboot Sledgehammer into a Scalpel. Proc. 8th Workshop on Hot Topics in Operating Systems (HotOS-VIII), Schloss Elmau, Germany, May 2001.
G. Candea, S. Kawamoto, Y. Fujiki, G. Friedman, A. Fox, Microreboot - A Technique for Cheap Recovery, Proc. 6th Symposium on Operating Systems Design and Implementation (OSDI), San Francisco, CA, December 2004.
G. Candea, J. Cutler, A. Fox, Improving Availability with Recursive Microreboots: A Soft-State System Case Study, Performance Evaluation Journal, Vol. 56, Nos. 1-3, March 2004.
G. Candea, M. Delgado, M. Chen, A. Fox, Automatic Failure-Path Inference: A Generic Introspection Technique for Internet Applications, Proc. 3rd IEEE Workshop on Internet Applications (WIAPP), San Jose, CA, June 2003

Diagnosis

Oppenheimer, D. The importance of understanding distributed system configuration. System Administrators are Users, Too: Designing Workspaces for Managing Internet-Scale Systems (CHI 2003 (Conference on Human Factors in Computing Systems) workshop), April 2003.
Chen, M., E. Kiciman, E. Fratkin, E. Brewer and A. Fox. Pinpoint: Problem Determination in Large, Dynamic, Internet Services. Proceedings of the International Conference on Dependable Systems and Networks (IPDS Track), Washington D.C., 2002. [abstract]
George Candea and Armando Fox. Designing for High Availability and Measurability. Proc. 1st Workshop on Evaluating and Architecting System Dependability (EASY), G�teborg, Sweden, July 2001.
Brown, A., G. Kar, and A. Keller. An Active Approach to Characterizing Dynamic Dependencies for Problem Determination in a Distributed Environment. Proceedings of the Seventh IFIP/IEEE International Symposium on Integrated Network Management (IM 2001), Seattle, WA, May 2001.
P. Bodik, G. Friedman, L. Biewald, HT Levine, G. Candea, K. Patel, G. Tolle, J. Hui, A. Fox, M. I. Jordan, D. Patterson, Combining Visualization and Statistical Analysis to Improve Operator Confidence and Efficiency for Failure Detection and Localization, Proc. 2nd International Conference on Autonomic Computing (ICAC), Seattle, WA, June 2005.

Benchmarking and System Measurement

Brown, A., L. Chung, W. Kakes, C. Ling, and D.A. Patterson. Experience with Evaluating Human-Assisted Recovery Processes. Proceedings of the 2004 International Conference on Dependable Systems and Networks. Florence, Italy, June 2004. [materials]
Broadwell, P. Response Time as a Performability Metric for Online Services. UC Berkeley Computer Science Technical Report UCB CSD-04-1324, May 2004.
Oppenheimer, D., Archana Ganapathi, and David A. Patterson. Why do Internet services fail, and what can be done about it? 4th USENIX Symposium on Internet Technologies and Systems (USITS '03), March 2003. [talk slides]
Oppenheimer, D., Aaron B. Brown, Jonathan Traupman, Pete Broadwell, and David A. Patterson. Practical issues in dependability benchmarking. Second Workshop on Evaluating and Architecting System Dependability (EASY), October 2002.
Oppenheimer, D. and D. A. Patterson. Studying and using failure data from large-scale Internet services. 10th ACM SIGOPS European Workshop, Saint-Emilion, France, September 2002.
Merzbacher, M and Dan Patterson. Measuring End-User Availability on the Web: Practical Experience. International Performance and Dependability Symposium, Washington DC, June 2002.
Oppenheimer, D. Why do Internet services fail, and what can be done about it? UC Berkeley Computer Science Division Technical Report UCB//CSD-02-1185, May 2002.
Patterson, D. A. A simple way to estimate the cost of downtime. Submission to 16th Systems Administration Conference (LISA '02), 2002.
Brown, A., L. C. Chung, D. A. Patterson. Including the Human Factor in Dependability Benchmarks. 2002 DSN Workshop on Dependability Benchmarking, Washington, D.C., June 2002.
Oppenheimer, D. and D. A. Patterson. Architecture, operation, and dependability of large-scale Internet services: three case studies. IEEE Internet Computing special issue on Global Deployment of Data Centers, September/October 2002.
Brown, A. Towards Availability and Maintainability Benchmarks: A Case Study of Software RAID Systems. UC Berkeley Masters Report, also available as UC Berkeley Computer Science Division Technical Report UCB//CSD-01-1132, Berkeley, CA, January 2001.
Brown, A. Availability Benchmarking of a Database System. Unpublished report, soon to be a Technical Report, Berkeley, CA, December 2000.
Brown, A. and D.A. Patterson. Towards Availability Benchmarks: A Case Study of Software RAID Systems. Proceedings of the 2000 USENIX Annual Technical Conference, San Diego, CA, June 2000.

ROC Hardware

Oppenheimer, D., A. Brown, J. Beck, D. Hettena, J. Kuroda, N. Treuhaft, D.A. Patterson, and K. Yelick. ROC-1: Hardware Support for Recovery-Oriented Computing. IEEE Transactions on Computers, vol. 51, no. 2, February 2002.

Talks

General ROC

A Simple Way to Estimate the Cost of Downtime. David Patterson. USENIX 16th System Administrators Conference (LISA '02). Presented November 7, 2002, Philadelphia, CA. [ppt]
Recovery Oriented Computing. David Patterson. Presented at Princeton University, University of Illinois, and University of Michigan, October 2002. [ppt]
Recovery Oriented Computing: A New Research Agenda for a New Century. David Patterson. 8th International Symposium on High-Performance Computer Architecture (HPCA 8) Keynote address, Presented February 6, 2002, Boston, MA. [abstract] [ppt] [MADtv clip] [MadTV clip script]
Availability and Maintainability >> Performance: New Focus for a New Century. David Patterson. USENIX Conference on File and Storage Technologies (FAST '02) Keynote address, Presented January 29, 2002, Monterey, CA. [Abstract] [ppt]
Recovery-Oriented Computing. Keynote Address by David Patterson at High Performance Transaction Systems Workshop (HPTS), October 2001. [ppt]
CS 294-4 First lecture. David Patterson. September 6, 2001 [ppt]
Recovery-Oriented Computing. David Patterson. HP Labs, June 6, 2001. [abstract] [ppt]
Embracing Failure: Availability through Recovery-Oriented Computing (ROC). Aaron Brown. Stanford CS548 Guest Lecture, May 2, 2001. [abstract] [ppt]
Embracing Failure: Availability through Repair-Centric Design. Aaron Brown. UC Berkeley Qualifying Examination Presentation, April 13, 2001. [ppt]
Reboot-Based High Availability. George Candea. Work-in-progress talk and poster, Symposium for Operating System Design and Implementation (OSDI),San Diego, CA, October 2000. [pdf](Abstract) [pdf](Poster)
Measuring End-User Availability and the Web: Practical Experience. Matthew Merzbacher. International Performance and Dependability Symposium, Washington DC, June 24, 2002.

Undo and Human Error

Rewind, Repair, Replay: Three R's to Dependability. Aaron Brown. SIGOPS European Workshop, St. Emilion, France, September 2002.
Rewind, Repair, Replay: Three R's to cope with human error. Talk given at IBM Almaden, March 2002. [ppt]
Bringing Undo to System Administration: A New Paradigm for Recovery. Work-in-progress talk, 15th Annual Systems Administration Conference (LISA 2001), December 2001. [ppt]
To Err is Human. First EASY Workshop, G�teborg, Sweden, July 1, 2001. [ppt]
Addressing Human Error with Undo. Summer 2001 ISTORE Retreat, Granlibakken, CA, June 2001. [ppt]

ROC Techniques

Diagnosis

An Active Approach to Characterizing Dynamic Dependencies for Problem Determination. IM 2001 Conference, May 16, 2001. [ppt]

Benchmarking

Availability and Maintainability Benchmarks: A Case Study of Software RAID Systems. UC Berkeley CS294-8 Guest Lecture, November 7, 2000. [ppt]

Retreat Talks and Posters

Hardware

Projects

SWORD: Scalable Wide-Area Resource Discovery
FIG: Library-Level Error Injection for Shared Libraries in UNIX/Linux. [download tar.gz]
Undo for System Administrators and Operators (source code available)

The ROC project is funded by NSF grant no. CCR-0085899, the NASA CICT (Computing, Information & Communication Technologies) Program, an NSF CAREER award, Allocity, Hewlett Packard, IBM, Microsoft, NEC, and Sun Microsystems.

Contact: roc-group at cs.berkeley.edu.
Last updated: 09/17/2008 11:28

The Berkeley/Stanford Recovery-Oriented Computing (ROC) Project

ROC News

People

Courses

Publications

General ROC

ROC Techniques

Diagnosis

Benchmarking and System Measurement

ROC Hardware

Talks

General ROC

Undo and Human Error

ROC Techniques

Diagnosis

Benchmarking

Retreat Talks and Posters

Hardware

Projects

The Berkeley/Stanford
Recovery-Oriented Computing (ROC)
Project