"If a problem has no solution, it may not be a
problem, but a fact -- not to be solved, but to be coped with over
time." High availability has always been a paramount concern in safety- and mission-critical applications, and recent experience with large Internet sites has underscored the need for availability in that domain as well. Traditional approaches to the problem have made three implicit assumptions: (1) failure rates of hardware and software are low and improving, (2) systems can be modeled for reliability analysis and their failure modes can be predicted, (3) human error during maintenance is not a major source of failures. The result is an emphasis on failure avoidance as the path to high availability. We claim that these assumptions are in many cases based on incorrect perceptions of today's environment, and that renewed emphasis should be given to failure recovery. Even the most highly tested systems occasionally exhibit "Heisenbugs" and suffer from transient or permanent hardware failure and software aging, and human error has empirically been found to account for a nontrivial fraction of catastrophic failures, such as the unavailability of Microsoft's site for several hours one day last year. The most successful systems have been those that can recover from these unexpected errors because they were designed for recovery, such as the Mars Pathfinder spacecraft. We introduce a new philosophy, recovery oriented computing (ROC). We will investigate new system models and high-availability techniques that (a) allow structuring of systems that can trade improved availability against performance or quality-of-service, (b) quantify these tradeoffs and allow availability to be more precisely measured, (c) explicitly account for human operators and provide mechanisms for recovering from human mistakes, and (d) clearly identify the application domains in which the models can be effectively applied. We will identify the role of ROC vis-a-vis existing research agendas in systems engineering, sofware engineering, and fault tolerant system design. Course requirements will include:
Registration is by permission of instructors only, and will be limited to about 20 students. Before the first week of class, email one of the instructors (see below) with your background, which CS courses you've taken in the graduate program, your research interests, and why you want to take the course. Please read the FAQ regarding background, requirements, topics, etc.. Berkeley students should email Prof. Dave Patterson for permission to enroll in CS294-4, 3 units. Stanford students should email Prof. Armando Fox for permission to enroll in CS 444A, 4 units.
Instructors
With help from:
Administrative support:
|
||||||
Modified: 12 May 2002
|