Overview and Announcements updated 12 May 2002


"If a problem has no solution, it may not be a problem, but a fact -- not to be solved, but to be coped with over time."
Shimon Peres

High availability has always been a paramount concern in safety- and mission-critical applications, and recent experience with large Internet sites has underscored the need for availability in that domain as well.  Traditional approaches to the problem have made three implicit assumptions: (1) failure rates of hardware and software are low and improving, (2) systems can be modeled for reliability analysis and their failure modes can be predicted, (3) human error during maintenance is not a major source of failures.  The result is an emphasis on failure avoidance as the path to high availability.  We claim that these assumptions are in many cases based on incorrect perceptions of today's environment, and that renewed emphasis should be given to failure recovery.  Even the most highly tested systems occasionally exhibit "Heisenbugs" and suffer from transient or permanent hardware failure and software aging, and human error has empirically been found to account for a nontrivial fraction of catastrophic failures, such as the unavailability of Microsoft's site for several hours one day last year.  The most successful systems have been those that can recover from these unexpected errors because they were designed for recovery, such as the Mars Pathfinder spacecraft.

We introduce a new philosophy, recovery oriented computing (ROC).  We will investigate new system models and high-availability techniques that (a) allow structuring of systems that can trade improved availability against performance or quality-of-service, (b) quantify these tradeoffs and allow availability to be more precisely measured, (c) explicitly account for human operators and provide mechanisms for recovering from human mistakes, and (d) clearly identify the application domains in which the models can be effectively applied.  We will identify the role of ROC vis-a-vis existing research agendas in systems engineering, sofware engineering, and fault tolerant system design.

Course requirements will include:

  • Substantial weekly readings from the literature and discussion with authors/presenters/other students,
  • Research-quality team project, to be presented as a short talk during last week of class and at a publicly-attended poster session at Stanford.  If you are doing research in this general area, it is likely your research can be leveraged into a course project.
  • Project writeup of sufficient quality to be submitted as an article to a refereed conference.  Again, if you are doing research or a similar course that has this requirement, you can double-up the written report.

Registration is by permission of instructors only, and will be limited to about 20 students.  Before the first week of class, email one of the instructors  (see below) with your background, which CS courses you've taken in the graduate program, your research interests, and why you want to take the course.

Please read the FAQ regarding background, requirements, topics, etc..

Berkeley students should email Prof. Dave Patterson for permission to enroll in CS294-4, 3 units.

Stanford students should email Prof. Armando Fox for permission to enroll in CS 444A, 4 units.

Venue and Logistics
  • Physical location and time:
    Soda Hall, UC Berkeley
    100 Gates Bldg, Stanford
    Every Thursday from 9/6/2001 to 12/6/2001
    Poster session at Stanford Friday 12/7/2001, followed by TGIF.
    Time: class from 3-4 PM; guest speaker 4-5 PM; dinner/discussion (food provided) until 6 PM.
  • Free Bus Transportation via the Magic Bus.
  • If you don't use the Magic Bus: We cannot get parking passes, so you'll have to use visitor pay parking or take public transportation.  Here ar alternatives for getting to Berkeley and alternatives for getting to Stanford (including driving directions).
Staff and Contact Information


With help from:

Administrative support:


Modified: 12 May 2002