Other lectures

Lecture 1: Overview


The "systems track" is a survey of existing systems and what kind of techniques they employ to achieve reliability and availability.  Collection of systems "folklore" for robust systems: decoupling components using announce/listen; soft state vs. hard state; orthogonal mechanisms and sandboxes; language support; introspection; the transaction concept.

This class looks at how to build systems that function continuously within their specifications, even in the face of faults and failures (the specification can allow for downtime, harvest degradation, etc.) Preferably, these systems are built from COTS. We are particularly interested in identifying basic building blocks that can be easily employed in building sophisticated, complex, highly reliable software systems.
 

Systems and infrastructures face four categories of threats:

Examples of large-scale failures:

SANDIA defines the term of surety = reliability in normal environments, safety in abnormal environments, secruity and use control under hostile circumstances. Surety Science and Engineering is aimed at building and retrofitting large system infrastructures, with a particular emphasis on the human management of these systems. Our class will not look into such types of activity, although they are critical for the proper functioning of such large scale systems.

What characterizes some of these systems (and why do we care) ?

What kind of systems will we be interested in tomorrow?

Do we need such a level of robustness ? Will we need more/less reliability in the future?

Potential requirements for these systems:


Terminology

Real systems vs. Formal methods

fault tolerance, resistance, fault vs failure, ... dependability, reliability, availabilty, ... common mode (non-isolated, non-indept) failures.  Safety-critical vs mission-critical systems: requirements, design approaches. Is downtime allowed? How much? Does system have to crash gracefully? Does it have to restart itself? Concept vocabulary: failstop, failsafe, nonstop...

Systems approaches: formal (theory of fault tolerance), practical.  Current relevance: mission/safety critical systems; COTS-based systems.  Why it's hard to assure integrity of COTS based systems (vs. dedicated monolithic designs) but why this is a fact of life (monolithic designs are not only cost-ineffective but too complex for one person/team to implement if they can't use prior work as building blocks).

Why worry about safety?
 


In reality, computer systems usually consist of:

That is what makes computer systems so complex. To make systems correct, we can take two approaches:

  1. Build a correct computer system, that will never fail or that will correctly tolerate failures. Trying to prove a computer system correct is a daunting task, because of the huge complexity.
  2. Build a reasonably correct system and provide redundant mechanisms to increase the probability of its correctness. Is this graph true (confidence = freedom from bugs)?

    Once we pick a point on this curve, it is easier to move it up by using certain cheap (in terms of effort) mechanisms (e.g., orthogonal, such as the phone switch hack) because we combine several of these curves in their middlepoints and, by multiplication, move the point up the curve.
    Does this contradict Lampson? ("Making a system reliable is not really hard, if you know how to go about it. But retrofitting reliability to an existing design is very difficult"). No, because here we are not retrofitting to an existing design, rather designing a new system in which the old one is a subsystem.
    Such an approach would lend itself excellently to building reliable systems out of COTS components (we do it all the time. CRCs, for instance). Using the end-to-end approach is a good way to provide reliability on top of unreliable components.

In regard to availability, there are two approaches:

  1. system that never fails: no internal faults (provably perfect system) and no external faults (provably perfect conditions); very difficult both because of uncertainty and because of the unsurmountable problem of recursively proving that every layer of the system is correct
  2. system that contains sufficient redundancy so that it always has the minimal subset of components necessary to perform according to its specification (this includes intelligent systems that would have the ability to dynamically respecialize components)

Such redundancy comes in a variety of flavors: orthogonal mechanisms (Fox's favorite), replicate the system's state machine (Lamport's idea, Lampson's favorite) etc.

Current state of things: "Perfection is basically impossible. Instead, be prepared to practice risk management, to detect problems, to design fallback mechanisms and countermeasures, and to recover." (Ron Rivest, paraphrased). We want to try to systematize the ad-hoc approaches to FT.

Lamport: "A distributed system is a system in which I can't do my work because some computer has failed that I've never even heard of." --> let's do away with that.

General structure of lectures:

 
(the class could be sort of an expanded Chapter 4 of Hints)
 
 

Internet-based systems, Internet as a test bed, factor-of-10 rule

Assignment ideas:


Systems Landscape and Concept Vocabulary [make this a handout?]

Assignment ideas:


Lecture 2: Redundancy

- redundancy = "double-edged sword" (consistency, overall failure rate, need to mask failures transparently, larger system+maintenance cost)
- replication
- N-version programming + consensus
- functional redundancy
- hypervisors
- non-stop kernel, Tandem Himalaya, 2-of-everything approach
- hardware redundancy vs. software redundancy
 

  1.  
  2. RAID as a fault-tolerant system
  3. Guardian paper
  4. Jim Gray. A census of tandem system availability between 1985 and 1990. IEEE Trans. Reliability, 39(4):409-418, October 1990.
  5. Implementing fault-tolerant services using the state machine approach: a tutorial, by F. B. Schneider, ACM Computing Surveys, 22(4):299-319, Dec. 1990 [PDF] [HTML]
  6. Hypervisor-based fault tolerance, by Thomas C. Bressoud and Fred B. Schneider, ACM Transactions on Computer Systems, 14(1):80-107, Feb. 1996  [PS] [PDF]
  7. "Fault tolerance", by Jim Gray and Andreas Reuter, Transaction Processing: Concepts and Techniques, Morgan Kaufmann, 1993, pages 93-156
  8. How to build a highly available system using consensus, by B. Lampson, in Distributed Algorithms, ed. Babaoglu and Marzullo, Lecture Notes in Computer Science 1151, Springer, 1996, pp 1-17 [PS] [PDF]

Assignment ideas:


Lecture 3: Fault Isolation/Containment. Imperviousness

- sandboxing
- firewalls
- SFI
- Mach
- virtual memory / address spaces

Concept of a sandbox and software fault isolation.  Examples: binary rewriting, Janus, firewall.  Using language support in sandboxing: type-safe languages (Spin and Modula-3), Exokernel (UDF's).  Why sandboxes are an orthogonal mechanism; orthogonal mechanisms in general.  Other examples of orthos: deadlock recovery in DB's; Berkeley TACC work.  Orthogonal mechanisms and simplicity: the joy of a small state space, esp. when composing n pieces, and how (whether?) this relates to formal techniques (being able to specify invariants because the state space is small enough to formally reason about).  Mars Pathfinder example (rebooting is an extreme orthogonal mechanism and not necessarily one that is always an option).

  1. "Efficient software-based fault isolation", Wahbe, Lucco, Anderson, Graham, in SOSP-14, pp. 203-216, Dec. 1993
  2. some Java ref. for sandboxing
  3. some firewall ref. for imperviousness
  4. a Mach ref. for general fault isolation
  5. Hardware fault containment in scalable shared-memory multiprocessors, by Dan Teodosiu, Joel Baxter, Kinshuk Govil, John Chapin, Mendel Rosenblum, and Mark Horowitz, Proc. 24th International Symposium on Computer Architecture, June 1997 [PS]
  6. Implementing Efficient Fault Containment for Multiprocessors, by Mendel Rosenblum, John Chapin, Dan Teodosiu, Scott Devine, Tirthankar Lahiri, and Anoop Gupta, Communications of the ACM 39(9):52-61, Sept. 1996 [PS]
  7. Designing a global name service, by B. Lampson, Proc. 4th ACM Symposium on Principles of Distributed Computing, Minaki, Ontario, 1986, pp 1-10. [PS] [PDF]

Example of the importance of fault isolation:

Assignment ideas:


Lecture 4: Language-based Mechanisms. End-to-End + Layers + Hierarchies

- meta-issue: achieving reliability on top of unreliable layers
- defensive programming
- language support (Spin, Vino, exo UDFs and DPFs)
- Janus, Slic ???
- Engler's latest stuff
- analysis of bug types, percentages, what caused them, etc., can argue for complexity of man-computer system (Yu paper)
- end-to-end approach (note Lampson's comments in Hints)
- relationship between end-to-end and formal methods vs. practical FT
 

  1. Jerome H. Saltzer, David P. Reed, and David. D. Clark. End-to-end arguments in system design. ACM Transactions on Computer Systems 2, 4 (November, 1984) pages 277-288.
  2. Reliability hierarchies, by Peter M. Chen and David E. Lowell, Proc. 1999 Workshop on Hot Topics in Operating Systems (HotOS), March 1999 [PS]
  3. A software fault prevention approach in coding and root cause analysis, by Weider D. Yu, Bell Labs Technical Journal, Apr-Jun 1998 [PDF] [HTML]
  4. Reliable messages and connection establishment, by Butler W. Lampson, in Distributed Systems, ed. S. Mullender, 2nd ed., Addison-Wesley, 1993, pp 251-281 [PS] [PDF]
  5. Reliable communication in the presence of failures, by Kenneth P. Birman and Thomas A. Joseph, ACM Transactions of Computer Systems, 5(1):47-76, Feb. 1987 [PDF]
  6. The Design Philoshophy of the DARPA Internet Procotols David D. Clark, Proceedings of the 1988 SIGCOMM Symposium, pp 106-114, Stanford, CA, August 1988.
  7. The hardest multiplexing problem: disk (ch. 4) and Reflections on downloading code (ch. 6), in The exokernel operating system architecture, by Dawson R. Engler, Ph.D. thesis, Massachusetts Institute of Technology, October 1998 [PS]
  8. Dealing with disaster: surviving misbehaved kernel extensions, by Margo I. Seltzer, Yasuhiro Endo, Christopher Small, and Keith A. Smith, Proc. 2nd Symposium on Operating System Design and Implementation, Seattle, WA, Oct. 1996, pp. 213-228 [PS][HTML]
  9. Extensibility, safety, and performance in the SPIN operating system, by Brian Bershad, Stefan Savage, Przemyslaw Pardyak, Emin G. Sirer, Marc Fiuczynski, David Becker, Susan Eggers, and Craig Chambers, Proc. 15th Symposium on Operating System Principles, Copper Mountain, CO, Dec. 1995, pp. 267-284 [PS]

Assignment ideas:


Lecture 5: Orthogonality, Application Decomposition and Restructuring, Design for Failure


"The unavoidable price of reliability is simplicity" (Hoare) That's cool, because our approach using orthogonal mechanisms is simpler than building reliability into each component, because the orthogonal components are (largely) independent and so they contain/limit the extent of complexity.

Long-term maintainability, incremental upgradability (the Internet vs. the power grid); how is this done?  ("infrastructure" systems vs. "control" systems?)

- harvest/yield tradeoffs, Inktomi (probabilistic consistency/completeness)
- exponential backoff (Oracle anecdote)
- SNS/TACC:  retry/restart mechanisms
- Brooks' subsumption architecture
- transactions give us atomic, restartable actions (graceful crashes...) - databases, crash recovery vs. crash avoidance
- (touch also upon Lomet's ideas)

How to exploit application semantics (or user expectations) in designing a system to accommodate various kinds of "expected" failures.  Examples: MS Tiger fileserver, ATM machine, airline reservations system, Internet-scale servers (harvest/yield tradeoffs)

What really happened on Mars

  1.  
  2. Nancy G. Leveson and Clark S. Turner. An investigation of the Therac-25 accidents. Computer 26, 7 (July, 1993) pages 18-41.
  3. Seltzer, M., Small, C., Smith, M., Symbiotic Systems Software, Workshop on Compiler Support for Systems Software (WCSSS '96). [PS]
  4. Efficient Transparent Application Recovery in Client-Server Information Systems, by David B. Lomet and G. Weikum, Proc. ACM SIGMOD Conference, Seattle, WA (June 1998), pp. 460-471 [PS]
  5. Bayou: Replicated Database Services for World-Wide Applications, by Karin Petersen, Mike J. Spreitzer, Doug B. Terry, and Marvin M. Theimer, Proc. Seventh ACM SIGOPS European Workshop (EuroSIGOPS '96), Connemara, Ireland, Sept. 1996, pp. 275-280 [PS]
  6. The CIRRUS banking network, by David Gifford and Alfred Spector, Communications of the ACM 28(8): 798-807, Aug. 1985 [PDF] [commentary]  --> a good example of how out-of-band mechanisms are employed to robustify the ATM network

  7.  

Example of the need to design for failure (the safety mechanism itself failed in a byzantine way):

Do you believe current airplanes are built to protect passengers in the event of crashes? To what extent? They are quite flimsy. Car manufacturers have started worrying about such thing only in recent years. In terms of safety, the conventional aircraft is a disaster when it comes to fuel alone. Fuel is stored in the wings, directly above the sources of flame (the engines) and under the passengers in the center of the aircraft (this was the fuel tank which exploded in TWA Flight #800). Some of the newer aircraft even store fuel in the horizontal tail fins. In case of accident, passengers are surrounded by fuel - alongside and behind and below (when gravity and the directional momentum of the aircraft are taken into account, this leads to most passengers being encircled by burning fuel). On Burnelli aircraft, the fuel tanks are only in the wings, with no fuel under or behind the passengers. The landing-gear is retracted into the body which is the main structure of the aircraft. Furthermore, the engines are on the top rear-most portion of the aircraft, away from the fuel tanks. All of the most volatile components have been isolated.

Assignment ideas:


Lecture 6: Transactions, ACID vs. BASE, Soft state & announce/listen

Transaction = an exception handling mechanism (if something fails, it gives you the ability to return where you left from and start again). As an unfortunate consequence, recursive/nested transactions are problematic (some parts of the trans succeed, others fail, so you roll back everything, including the successful parts).

Transaction concept as a building block for FT systems; ACID semantics, why they're hard to get right and even harder to get good performance as well.  Why crash recovery under ACID is hard; write ahead logging, etc.  The price of the transactional abstraction.

Loose coupling with Soft state and announce/listen: design of wide-area Internet protocols, including route advertisement, multicast routing, RSVP, SRM.  Role of soft state and announce/listen in these systems. Extensions- using these mechanisms at the application level.

"What's cool about transactions is that the semantics are so simple" (Jim Gray, personal communication)

Transactions cannot be really viewed as an orthogonal FT mechanism, because they rely on a multitude of layers below them; the transaction is a the top of pyramid.

     

  1. The TWA Reservations System, by David Gifford and Alfred Spector, Communications of the ACM 27(7) --> good example of trading consistency for availability and performance


Assignment ideas:


Lecture 7: Introspective Sys, Self-Repairing Sys, Embryonic Sys, Evolvable Sys, Self-Reconfiguring Sys

The mythical telephone-switch-software daemons that run around and fix up corrupted data structures [need a ref for this!].  ISTORE and other self-monitoring systems.   If these techniques were more disciplined, woudl there be a place for this as a first-class design methodology in a software engineering curriculum?

  1. ISTORE: Introspective Storage for Data-Intensive Network Services, by Brown, A., D. Oppenheimer, K. Keeton, R. Thomas, J. Kubiatowicz, and D.A. Patterson, Proc. 7th Workshop on Hot Topics in Operating Systems (HotOS-VII), Rio Rico, Arizona, March 1999 [PS]
  2. Autonet: A high-speed, self-configuring local area network using point-to-point links, by Michael Schroeder, Andrew D. Birrell, Michael Burrows, Hal Murray, Roger M. Needham, Thomas L. Rodeheffer, Edwin H. Satterthwaite, Charles P. Thacker, IEEE Journal on Selected Areas in Communications, 9(8): 1318-1335, Oct. 1991 [PS] [PDF]
  3. Fault-tolerant Systems: The way Biology does it!, by Cesar Ortega and Andy Tyrrell, Proc. Euromicro 97 (Short Contributions), Budapest, IEEE CS Press, Sept. 1997, pp.146-151 [PDF]
  4. Computer Know Thy Self!: A Biological Way to Look at Fault-Tolerance, by Andy Tyrrell, Proc. 2nd Euromicro Workshop on Dependable Computing Systems, Sept. 1999
  5. Self-Organizing Collaborative Environments, by Balakrishnan, H., Seshan, S., Bhagwat, P., and Kaashoek, F. Position paper for the NSF/DARPA/NIST Workshop on Smart Environments, Atlanta, GA, July 1999. [TXT]
  6. Building an Artificial Brain Using an FPGA Based CAM-Brain Machine, by Hugo de Garis, Michael Korkin, Felix Gers, Eiji Nawa, Michael Hough, Applied Mathematics and Computation Journal, Special Issue on "Artificial Life and Robotics, Artificial Brain, Brain Computing and Brainware", North Holland, 1999 [PS] [PDF]
  7. Evolvable hardware: The genetic programming of Darwin machines, by Hugo de Garis, International Conference on Artificial Neural Nets and Genetic Algorithms, 1993, Innsbruck, Austria [PS] [PDF]
  8. Baring it all to Software: Raw Machines, by Elliot Waingold, Michael Taylor, Devabhaktuni Srikrishna, Vivek Sarkar, Walter Lee, Victor Lee, Jang Kim, Matthew Frank, Peter Finch, Rajeev Barua, Jonathan Babb, Saman Amarasinghe, and Anant Agarwal, IEEE Computer 30(9):86-93, Sept. 1997 [PS] [PDF]
  9. Raw Computation, by Anant Agarwal, Scientific American, August 1999 [HTML]

  10.  

Assignment ideas:


Lecture 8: Wishful Thinking / Guest Lecture. Wrapup

- Identify how to combine systems techniques w/formal techniques. General strategy: find out what properties, invariants, etc. are enjoyed by software constructed "the right way" and see whether the above mechanisms can be used to give these properties to software written "the wrong way".
- Project issue ???  ( ?)

Assignment ideas:


Unassigned refs:

  1. The Systematic Improvement of Fault Tolerance in the Rio File Cache, by Wee Teck Ng and Peter M. Chen, Proc. 1999 Symposium on Fault-Tolerant Computing (FTCS) , June 1999 [PS]
  2. Crash recovery in a distributed data storage system, by Butler W. Lampson, unpublished technical report, Xerox Palo Alto Research Center, June 1979 [PS] [PDF]
  3. Jim Gray and Daniel P. Sieworek. High-availability computer systems. Computer 24, 9 (September, 1991) pages 39-48.
  4. Henry Petroski. Engineering: History and failure. American Scientist 80, 6 (November-December, 1992) pages 523-526.
  5. "Vision for High-Availability of the Global Infrastructure", panel at the 28th Annual International Symposium on Fault-Tolerant Computing, 23 - 25 June, 1998, Munich, Germany (included Ram Chillarege [IBM], Robert Horst, Clifford Meltzer, Angelo Pruscino [Oracle] and Dalibor F. Vrsalovic)
  6. "TFT: A Software System for Application-Transparent Fault Tolerance", by Thomas C. Bressoud. In Twenty-Eighth Annual International Symposium on Fault-Tolerant Computing, Pp. 128-137, June 1998. [PDF]
  7. Alfred Spector and David Gifford. The space shuttle primary computer system. Communications of the ACM 27, 9 (September, 1984) pages 874-900.
  8. Why cryptosystems fail, by Ross J. Anderson, Communications of the ACM 37(11):32-40, Nov. 1994) [PS]  --> ATMs
  9. Ptolemy/DS1 probe (Barney Pell)
  10. A computer science perspective of bridge design, by Alfred Spector and David Gifford, Communications of the ACM, Vol. 29, No. 4 (April 1986), Pages 267-283 [PDF]  --> an interesting way to look at other engineering fields and try to apply the lessons to the computer systems field
  11. Case study: IBM's system/360-370 architecture, by David Gifford and Alfred Spector, Communications of the ACM, 30(4): 291-307, Apr. 1987 [PDF]

Homework assignments (currently unsorted and unmatched to topics)

  1. Discuss some ways to retrofix Y2K?
  2. Argue some convincing cases in which orthogonal mechanisms are of limited utility.

Programming/Lab assignments

  1. Robustification library. What could you put in it? How would you express the operations it can do in such a way that applications can pick their recovery policies? Should it be kernel code or a user level library? Would it be truly orthogonal (like a sandbox) or quasi-orthogonal (like a debug-malloc library)?
  2. Hazard analysis of a "large-component" system, e.g. a system in which the "components" are things like Apache, Oracle DB server, etc. Follow the same methodology as for "traditional" failure analysis. How would the analysis change when glass-box info from the developer of each component is added?
  3. Simulator that generates "directed random" failures, either as a source of stimulation or as a SimOS-type library that apps can be run on top of for testing.