|
The Berkeley/Stanford
Recovery-Oriented Computing (ROC)
Project
|
The Recovery-Oriented Computing (ROC) project is a joint Berkeley/Stanford research
project that is investigating novel techniques for building highly-dependable Internet
services.
Read an overview of our research into Recovery-Oriented Computing.
Quick navigate to: [ Research |
People | Publications |
Talks | Retreats |
Internal ]
ROC News
People
|
Faculty
Berkeley Graduate Students
|
Stanford Graduate Students
Other Alumni
Group Photographs
|
Courses
Fall 2004
Fall 2001
Publications
General ROC
- Patterson, D. A., A. Brown, P. Broadwell, G. Candea, M. Chen, J. Cutler, P.
Enriquez, A. Fox, E. Kiciman, M. Merzbacher, D. Oppenheimer, N. Sastry, W. Tetzlaff,
J. Traupman, N. Treuhaft. Recovery-Oriented
Computing (ROC): Motivation, Definition, Techniques, and Case Studies.
UC Berkeley Computer Science Technical Report UCB//CSD-02-1175, March 15,
2002.
- Brown, A. and D. A. Patterson. Embracing
Failure: A Case for Recovery-Oriented Computing (ROC).2001 High Performance
Transaction Processing Symposium, Asilomar, CA, October 2001.
- Brown, A. and D. A. Patterson. To Err is Human.
Proceedings of the First Workshop on Evaluating and Architecting System
dependabilitY (EASY '01), G�teborg, Sweden, July 2001.
- Brown, A. Accepting Failure: Availability
through Repair-Centric System Design. UC Berkeley Qualifying Examination
Proposal, Berkeley, CA, April 2001.
- G. Candea, A. Brown, A. Fox, D. Patterson,
Recovery-Oriented Computing: Building
Multitier Dependability, IEEE Computer, November 2004.
- G. Candea, E. Kiciman, S. Kawamoto, A. Fox,
Autonomous Recovery in Componentized
Internet Applications, Cluster Computing Journal, Vol. 9, No. 1, February
2006.
- G. Candea, E. Kiciman, S. Zhang, P. Keyani, A. Fox,
JAGR: An Autonomous Self-Recovering Application
Server, Proc. 5th International Workshop on Active Middleware Services, Seattle,
WA, June 2003
- G. Candea and A. Fox, Crash-Only
Software, Proc. 9th Workshop on Hot Topics in Operating Systems (HotOS-IX),
Lihue, Hawaii, May 2003.
ROC Techniques
- Brown, A. A Recovery-Oriented
Approach to Dependable Services: Repairing Past Errors With System-Wide Undo,
UC Berkeley Computer Science Division Technical Report UCB//CSD-04-1304, December
2003. [abstract]
- Brown, A. and D. A. Patterson.
Undo for Operators: Building an
Undoable E-mail Store. In Proceedings of the 2003 USENIX Annual Technical
Conference, San Antonio, TX, June 2003 (Best Paper Award).
- Brown, A. and D. A. Patterson. Rewind,
Repair, Replay: Three R's to Dependability. 10th ACM SIGOPS European Workshop,
Saint-Emilion, France, September 2002.
- George Candea and Armando Fox.
A Utility-Centered Approach to Building Dependable
Infrastructure Services, Appears in Proc. 10th ACM SIGOPS European
Workshop (EW-2002), Saint-�milion, France, September 2002.
- Broadwell, P., N. Sastry and J. Traupman. FIG: A
Prototype Tool for Online Verification of Recovery Mechanisms. Workshop
on Self-Healing, Adaptive and self-MANaged Systems (SHAMAN), New York, NY,
June 2002.
- George Candea, James Cutler, Armando Fox, Rushabh Doshi, Priyank Garg, Rakesh
Gowda. Reducing Recovery Time in a Small
Recursively Restartable System. Proc. International Conference on Dependable
Systems and Networks (DSN-2002), Washington, D.C., June 2002.
- George Candea and Armando Fox.
Recursive Restartability: Turning
the Reboot Sledgehammer into a Scalpel. Proc. 8th Workshop on Hot Topics
in Operating Systems (HotOS-VIII), Schloss Elmau, Germany, May 2001.
- G. Candea, S. Kawamoto, Y. Fujiki, G. Friedman, A. Fox, Microreboot - A Technique for Cheap Recovery, Proc. 6th Symposium on Operating Systems Design and Implementation (OSDI), San Francisco, CA, December 2004.
- G. Candea, J. Cutler, A. Fox, Improving Availability with Recursive Microreboots: A Soft-State System Case Study, Performance Evaluation Journal, Vol. 56, Nos. 1-3, March 2004.
- G. Candea, M. Delgado, M. Chen, A. Fox, Automatic Failure-Path Inference: A Generic Introspection Technique for Internet Applications, Proc. 3rd IEEE Workshop on Internet Applications
(WIAPP), San Jose, CA, June 2003
Diagnosis
- Oppenheimer, D. The importance of understanding
distributed system configuration. System Administrators are Users, Too:
Designing Workspaces for Managing Internet-Scale Systems (CHI 2003 (Conference
on Human Factors in Computing Systems) workshop), April 2003.
- Chen, M., E. Kiciman, E. Fratkin, E. Brewer and A. Fox.
Pinpoint: Problem Determination in Large,
Dynamic, Internet Services. Proceedings of the International Conference
on Dependable Systems and Networks (IPDS Track), Washington D.C., 2002. [abstract]
- George Candea and Armando Fox. Designing
for High Availability and Measurability. Proc. 1st Workshop on Evaluating
and Architecting System Dependability (EASY), G�teborg, Sweden, July 2001.
- Brown, A., G. Kar, and A. Keller. An Active Approach
to Characterizing Dynamic Dependencies for Problem Determination in a Distributed
Environment. Proceedings of the Seventh IFIP/IEEE International Symposium
on Integrated Network Management (IM 2001), Seattle, WA, May 2001.
- P. Bodik, G. Friedman, L. Biewald, HT Levine, G. Candea, K. Patel, G. Tolle, J. Hui, A. Fox, M. I. Jordan, D. Patterson, Combining Visualization and Statistical Analysis to Improve Operator Confidence and Efficiency for Failure Detection and Localization, Proc. 2nd International Conference on Autonomic Computing (ICAC), Seattle, WA, June 2005.
Benchmarking and System Measurement
- Brown, A., L. Chung, W. Kakes, C. Ling, and D.A. Patterson.
Experience with Evaluating Human-Assisted
Recovery Processes. Proceedings of the 2004 International Conference on
Dependable Systems and Networks. Florence, Italy, June 2004. [materials]
- Broadwell, P. Response Time as a Performability
Metric for Online Services. UC Berkeley Computer Science Technical Report
UCB CSD-04-1324, May 2004.
- Oppenheimer, D., Archana Ganapathi, and David A. Patterson.
Why do Internet services fail, and what can be done
about it? 4th USENIX Symposium on Internet Technologies and Systems (USITS
'03), March 2003. [talk slides]
- Oppenheimer, D., Aaron B. Brown, Jonathan Traupman, Pete Broadwell, and David
A. Patterson. Practical issues in dependability benchmarking.
Second Workshop on Evaluating and Architecting System Dependability (EASY),
October 2002.
- Oppenheimer, D. and D. A. Patterson.
Studying and using failure data from large-scale Internet services. 10th
ACM SIGOPS European Workshop, Saint-Emilion, France, September 2002.
- Merzbacher, M and Dan Patterson.
Measuring End-User
Availability on the Web: Practical Experience. International Performance
and Dependability Symposium, Washington DC, June 2002.
- Oppenheimer, D. Why do Internet services
fail, and what can be done about it? UC Berkeley Computer Science Division
Technical Report UCB//CSD-02-1185, May 2002.
- Patterson, D. A. A simple way to estimate
the cost of downtime. Submission to 16th Systems Administration Conference
(LISA '02), 2002.
- Brown, A., L. C. Chung, D. A. Patterson.
Including the Human Factor in Dependability
Benchmarks. 2002 DSN Workshop on Dependability Benchmarking, Washington,
D.C., June 2002.
- Oppenheimer, D. and D. A. Patterson. Architecture,
operation, and dependability of large-scale Internet services: three case studies.
IEEE Internet Computing special issue on Global Deployment of Data Centers,
September/October 2002.
- Brown, A. Towards Availability and Maintainability
Benchmarks: A Case Study of Software RAID Systems. UC Berkeley Masters
Report, also available as UC Berkeley Computer Science Division Technical
Report UCB//CSD-01-1132, Berkeley, CA, January 2001.
- Brown, A. Availability Benchmarking of a Database System. Unpublished report,
soon to be a Technical Report, Berkeley, CA, December 2000.
- Brown, A. and D.A. Patterson. Towards Availability
Benchmarks: A Case Study of Software RAID Systems. Proceedings of the 2000
USENIX Annual Technical Conference, San Diego, CA, June 2000.
ROC Hardware
- Oppenheimer, D., A. Brown, J. Beck, D. Hettena, J. Kuroda, N. Treuhaft, D.A.
Patterson, and K. Yelick. ROC-1: Hardware Support for
Recovery-Oriented Computing. IEEE Transactions on Computers, vol. 51,
no. 2, February 2002.
Talks
General ROC
- A Simple Way to Estimate the Cost of Downtime. David Patterson.
USENIX 16th System Administrators
Conference (LISA '02). Presented November 7, 2002, Philadelphia, CA. [ppt]
- Recovery Oriented Computing. David Patterson. Presented at Princeton University,
University of Illinois, and University of Michigan, October 2002. [ppt]
- Recovery Oriented Computing: A New Research Agenda for a New Century. David
Patterson. 8th
International Symposium on High-Performance Computer Architecture (HPCA 8)
Keynote address, Presented February 6, 2002, Boston, MA. [abstract]
[ppt] [MADtv
clip] [MadTV clip
script]
- Availability and Maintainability >> Performance: New Focus for a New Century.
David Patterson. USENIX Conference
on File and Storage Technologies (FAST '02) Keynote address, Presented January
29, 2002, Monterey, CA. [Abstract] [ppt]
- Recovery-Oriented Computing. Keynote Address by David Patterson at High
Performance Transaction Systems Workshop (HPTS), October 2001. [ppt]
- CS 294-4 First lecture. David Patterson. September 6, 2001
[ppt]
- Recovery-Oriented Computing. David Patterson. HP Labs, June 6, 2001.
[abstract] [ppt]
- Embracing Failure: Availability through Recovery-Oriented Computing (ROC).
Aaron Brown. Stanford CS548 Guest Lecture, May 2, 2001. [abstract]
[ppt]
- Embracing Failure: Availability through Repair-Centric Design. Aaron Brown.
UC Berkeley Qualifying Examination Presentation, April 13, 2001. [ppt]
-
Reboot-Based High Availability.
George Candea. Work-in-progress talk and poster, Symposium for Operating System
Design and Implementation (OSDI),San Diego, CA, October 2000. [pdf](Abstract)
[pdf](Poster)
-
Measuring End-User Availability and the Web: Practical
Experience. Matthew Merzbacher. International Performance and Dependability
Symposium, Washington DC, June 24, 2002.
Undo and Human Error
- Rewind, Repair, Replay: Three R's to Dependability. Aaron Brown. SIGOPS
European Workshop, St. Emilion, France, September 2002.
- Rewind, Repair, Replay: Three R's to cope with human error. Talk given
at IBM Almaden, March 2002. [ppt]
- Bringing Undo to System Administration: A New Paradigm for Recovery. Work-in-progress
talk, 15th Annual Systems Administration Conference (LISA 2001), December
2001. [ppt]
- To Err is Human. First EASY Workshop, G�teborg, Sweden, July 1, 2001.
[ppt]
- Addressing Human Error with Undo. Summer 2001 ISTORE Retreat, Granlibakken,
CA, June 2001. [ppt]
ROC Techniques
Diagnosis
Benchmarking
Retreat Talks and Posters
Hardware
Projects
The ROC project is funded by NSF grant no. CCR-0085899, the NASA CICT (Computing,
Information & Communication Technologies) Program, an NSF CAREER award, Allocity,
Hewlett Packard, IBM, Microsoft, NEC, and Sun Microsystems.
Contact: roc-group at cs.berkeley.edu.
Last updated:
09/17/2008 11:28