Stanford CS444A / Berkeley CS294-4
Recovery-Oriented Computing
Fall 2001




Bus Information

Staff/Contact Info







Other Stanford/Berkeley Collaboration

Reading List

Quick jump to readings due: 11/8 | 11/1 | 10/11 | 10/04 | 9/27 | 9/20 | 9/13

November 8

November 1

October 11

  • R.P. Goldberg, Survey of Virtual Machine Reseach. IEEE Computer Magazine 7(6): 34-45 (1974)
  • T.C. Bressoud and F.B. Schneider, Hypervisor-based fault tolerance. ACM Transactions on Computers 14(1):80-107 (1996)

October 4

No readings.

September 27

This week, we're doing something different. Instead of a standard reading assignment, we'd like you to do some thinking to prepare for an interactive discussion. The details can be found in Assignment 1, due 9/27.

September 20

Required readings:

Optional readings (these provide more technical detail and historical perspective on the required Spainhower reading above):

September 13

Optional readings for more depth:

  • Alan Demers, Karin Petersen, Mike Spreitzer, Douglas Terry, Marvin Theimer, Brent Welch. The Bayou Architecture: Support for Data Sharing Among Mobile Users. Proc. SOSP-15, Copper Mountain, CO, 1995.  Bayou trades consistency for availability by providing a set of well-defined user models with varying guarantees, so applications can choose the model appropriate for their semantics.  This is the "canonical" paper on trading consistency for availability.
  • David E. Lowell, Subhachandra Chandra, and Peter M. Chen.  Exploring Failure Transparency and the Limits of Generic Recovery.  In Proceedings OSDI 2000, San Diego, CA, Oct. 2000.  Theory and implementation of transparent checkpointing and recovery for Unix programs.  Conclusion is that programs rarely fail in ways that allow this approach to be applied effectively.
  • George Candea and Armando Fox.  Recursive Restartability: Turning the Reboot Sledgehammer Into a Scalpel.  In Proc. HotOS-VIII, Elmau, Germany, May 2001.  Proactive and reactive restarts can be used to improve overall system availability, if the system is structured to allow components to be restarted independently.
  • Eric A. Brewer.  Lessons from Giant-Scale Services.  IEEE Internet Computing, July/August 2001.  Link above is an earlier draft; if you have access to an IEEE online account, it's recommended that you retrieve the camera-ready version.  Some aspects of availability and failure tolerance depend on physical constraints and boundaries as much as on software engineering.
  • Miguel Castro and Barbara Liskov, Practical Byzantine Fault ToleranceBy using cryptographically-signed messages, Byzantine agreement is fast enough to implement a Byzantine-fault-resistant NFS server.
  • Haifeng Yu and Amin Vahdat, Design and Evaluation of  a Continuous Consistency Model for Replicated Services. (also .ps)  In Proc. OSDI 2000.  TACT framework for quantifying availability/consistency tradeoffs using app-specific units called conits.

Recommended Books

  • Safeware: System Safety and Computers, by Nancy G. Leveson.  Overview of the field of computer system safety, a bit chatty but has some good background from the established safety-critical-systems community.  Various appendices detail some case studies of systems that failed, including the Ariane 5 and Therac-25 stories in appendices.
  • Fault Tolerance in Distributed Systems, by Pankaj Jalote.  Good reference book for basic concepts and algorithms of "classical" fault tolerance: distributed consensus, Byzantine agreement, replication and consistency, stable storage, atomic transactions, reliable broadcast, etc.
  • Human Error, by Jonathan Reason
  • Natural Accidents, by Charles Perrow.

Systems Overview

Other Useful Resources


Modified: 12 May 2002

[ Home | Schedule | All Abstracts & Bios ]