CS 548: Internet and Distributed Systems Seminar

Reading List

Quick jump to readings due: 11/8 | 11/1 | 10/11 | 10/04 | 9/27 | 9/20 | 9/13

November 8

Jim Gray, What Next? A Dozen Information-Technology Research Goals, Microsoft Research Technical Report MS-TR-99-50, June 1999.

November 1

Brendan Murphy and Ted Gent, Measuring System and Software Reliability using an Automated Data Collection Process, Quality and Reliability Engineering International, Vol 11, pp. 341-353, 1995.
Jim Gray, A Census of Tandem System Availability Between 1985 and 1990, Tandem Technical Report 90.1, January 1990.

October 11

R.P. Goldberg, Survey of Virtual Machine Reseach. IEEE Computer Magazine 7(6): 34-45 (1974)
T.C. Bressoud and F.B. Schneider, Hypervisor-based fault tolerance. ACM Transactions on Computers 14(1):80-107 (1996)

October 4

No readings.

September 27

This week, we're doing something different. Instead of a standard reading assignment, we'd like you to do some thinking to prepare for an interactive discussion. The details can be found in Assignment 1, due 9/27.

September 20

Required readings:

K. Appleby, S. Fakhouri, L. Fong, G. Goldszmidt, et al. Oceano - SLA Based Management of a Computing Utility. In Proc. 2001 IEEE/IFIP Int'l Symp. on Integrated Network Management (IM 2001), Seattle, WA, May 2001.
L. Spainhower and T. Gregg. IBM S/390 Parallel Enterprise Server G5 fault tolerance: A historical perspective. IBM Journal of Research and Development 43(5-6):863-873, Sept.-Nov. 1999.

Optional readings (these provide more technical detail and historical perspective on the required Spainhower reading above):

L. Spainhower and T. Gregg. G4: A Fault-Tolerant CMOS Mainframe. In Proc. 28th International Symposium on Fault Tolerant Computing, Munich, Germany, June 1998.
L. Spainhower, J. Isenberg, R. Chillarege, J. Berding. Design for fault-tolerance in system ES model 900. In Proc. 22nd International Symposium on Fault-Tolerant Computing, Boston, MA, July 1992.

September 13

Richard P. Gabriel, The Rise of "Worse is Better". This is section 2.1 of Lisp: Good News, Bad News, How to Win Big, AI Expert 6(6):33-35, June 1991. Optional: the whole article [HTML] [PDF]. Also, for your enjoyment, here is a little story on how these ideas evolved.
Jerome H. Saltzer, David P. Reed, and David D. Clark. End-to-End Arguments in System Design. ACM Transactions on Computer Systems 2(4), Nov. 1984, pp. 277-288.
Mars Pathfinder priority inversion explanation, and lessons learned (from Risks Digest)

Optional readings for more depth:

Alan Demers, Karin Petersen, Mike Spreitzer, Douglas Terry, Marvin Theimer, Brent Welch. The Bayou Architecture: Support for Data Sharing Among Mobile Users. Proc. SOSP-15, Copper Mountain, CO, 1995. Bayou trades consistency for availability by providing a set of well-defined user models with varying guarantees, so applications can choose the model appropriate for their semantics. This is the "canonical" paper on trading consistency for availability.
David E. Lowell, Subhachandra Chandra, and Peter M. Chen. Exploring Failure Transparency and the Limits of Generic Recovery. In Proceedings OSDI 2000, San Diego, CA, Oct. 2000. Theory and implementation of transparent checkpointing and recovery for Unix programs. Conclusion is that programs rarely fail in ways that allow this approach to be applied effectively.
George Candea and Armando Fox. Recursive Restartability: Turning the Reboot Sledgehammer Into a Scalpel. In Proc. HotOS-VIII, Elmau, Germany, May 2001. Proactive and reactive restarts can be used to improve overall system availability, if the system is structured to allow components to be restarted independently.
Eric A. Brewer. Lessons from Giant-Scale Services. IEEE Internet Computing, July/August 2001. Link above is an earlier draft; if you have access to an IEEE online account, it's recommended that you retrieve the camera-ready version. Some aspects of availability and failure tolerance depend on physical constraints and boundaries as much as on software engineering.
Miguel Castro and Barbara Liskov, Practical Byzantine Fault Tolerance. By using cryptographically-signed messages, Byzantine agreement is fast enough to implement a Byzantine-fault-resistant NFS server.
Haifeng Yu and Amin Vahdat, Design and Evaluation of a Continuous Consistency Model for Replicated Services. (also .ps) In Proc. OSDI 2000. TACT framework for quantifying availability/consistency tradeoffs using app-specific units called conits.

Recommended Books

Safeware: System Safety and Computers, by Nancy G. Leveson. Overview of the field of computer system safety, a bit chatty but has some good background from the established safety-critical-systems community. Various appendices detail some case studies of systems that failed, including the Ariane 5 and Therac-25 stories in appendices.
Fault Tolerance in Distributed Systems, by Pankaj Jalote. Good reference book for basic concepts and algorithms of "classical" fault tolerance: distributed consensus, Byzantine agreement, replication and consistency, stable storage, atomic transactions, reliable broadcast, etc.
Human Error, by Jonathan Reason
Natural Accidents, by Charles Perrow.

Systems Overview

ACID semantics and transactions: lecture 1 of Joe Hellerstein's UC Berkeley CS 186 (Databases)

Other Useful Resources

Risks digest archives and subscription info
Phil Koopman's (CMU) pages on dependability, affordable dependability, and hardening COTS hardware and software

Modified: 12 May 2002