Quick jump to readings due: 11/8 | 11/1 | 10/11 | 10/04 | 9/27 | 9/20
- R.P. Goldberg, Survey of Virtual Machine Reseach. IEEE Computer
Magazine 7(6): 34-45 (1974)
- T.C. Bressoud and F.B. Schneider, Hypervisor-based fault tolerance. ACM Transactions on Computers
This week, we're doing something different. Instead of a standard
reading assignment, we'd like you to do some thinking to prepare for an
interactive discussion. The details can be found in Assignment
1, due 9/27.
Optional readings (these provide more technical detail and
historical perspective on the required Spainhower reading above):
- Richard P. Gabriel, The Rise of
"Worse is Better". This is section 2.1 of Lisp: Good News, Bad News, How to Win Big,
AI Expert 6(6):33-35, June 1991. Optional: the whole
article [HTML] [PDF].
Also, for your enjoyment, here is a little story on how these ideas evolved.
- Jerome H. Saltzer, David P. Reed, and David D. Clark. End-to-End Arguments
in System Design. ACM Transactions on Computer Systems 2(4),
Nov. 1984, pp. 277-288.
- Mars Pathfinder priority
inversion explanation, and lessons learned (from Risks Digest)
Optional readings for more depth:
- Alan Demers, Karin Petersen, Mike Spreitzer, Douglas Terry, Marvin Theimer, Brent Welch.
Bayou Architecture: Support for Data Sharing Among Mobile Users. Proc. SOSP-15, Copper
Mountain, CO, 1995. Bayou trades consistency for availability by providing a set
of well-defined user models with varying guarantees, so applications can choose the model
appropriate for their semantics. This is the "canonical"
paper on trading consistency for availability.
- David E. Lowell, Subhachandra Chandra, and Peter M. Chen.
Failure Transparency and the Limits of Generic Recovery.
In Proceedings OSDI 2000, San Diego, CA, Oct. 2000. Theory
and implementation of transparent checkpointing and recovery for Unix
programs. Conclusion is that programs rarely fail in ways that
allow this approach to be applied effectively.
- George Candea and Armando Fox. Recursive Restartability:
Turning the Reboot Sledgehammer Into a Scalpel. In
Proc. HotOS-VIII, Elmau, Germany, May 2001. Proactive and
reactive restarts can be used to improve overall system
availability, if the system is structured to allow components to be
- Eric A. Brewer. Lessons
from Giant-Scale Services. IEEE Internet Computing,
July/August 2001. Link above is an earlier draft; if you have
access to an IEEE online account, it's recommended that you retrieve
the camera-ready version. Some aspects of availability and
failure tolerance depend on physical constraints and boundaries as
much as on software engineering.
- Miguel Castro and Barbara Liskov, Practical
Byzantine Fault Tolerance. By using
cryptographically-signed messages, Byzantine agreement is fast enough
to implement a Byzantine-fault-resistant NFS server.
- Haifeng Yu and Amin Vahdat, Design
and Evaluation of a Continuous Consistency Model for Replicated
Services. (also .ps)
In Proc. OSDI 2000. TACT framework for quantifying
availability/consistency tradeoffs using app-specific units called
- Safeware: System Safety and
Computers, by Nancy G. Leveson. Overview of
the field of computer system safety, a bit chatty but has some good background from the
established safety-critical-systems community. Various appendices detail some case
studies of systems that failed, including the Ariane 5 and Therac-25 stories
- Fault Tolerance in Distributed
Systems, by Pankaj Jalote. Good reference
book for basic concepts and algorithms of "classical" fault tolerance:
distributed consensus, Byzantine agreement, replication and consistency, stable storage, atomic transactions,
reliable broadcast, etc.
- Human Error, by Jonathan Reason
- Natural Accidents, by Charles Perrow.
Other Useful Resources