Other lectures
- "Who do you trust", what do you do to exploit that trust, how do
you detect or recover from uncommon case when trust fails.
Lecture 1: Overview
The "systems track" is a survey of existing systems and what kind of techniques
they employ to achieve reliability and availability. Collection of systems
"folklore" for robust systems: decoupling components using announce/listen; soft
state vs. hard state; orthogonal mechanisms and sandboxes; language support;
introspection; the transaction concept.
This class looks at how to build systems that function continuously within their
specifications, even in the face of faults and failures (the specification can allow for
downtime, harvest degradation, etc.) Preferably, these systems are built from COTS. We are
particularly interested in identifying basic building blocks that can be easily employed
in building sophisticated, complex, highly reliable software systems.
- "national critical infrastructures", as defined by the Presidential Commission
on Critical Infrastructure Protection are:
- information and communication
- transportation
- banking and finance
- emergency services
- government services
- electric power
- oil and gas
- water systems
- national defense infrastructure
- national power grid + nuclear power plants/reactors
- air traffic control system
- stock exchange
- space stations + shuttles
- medical equipment
- water filtration plants
- railroad crossing barriers, traffic light controls
- anti-lock brake systems in your car
- microwave ovens
- Internet backbones
- e-commerce servers
Systems and infrastructures face four categories of threats:
- physical (aging, degradation, natural disasters, acts of war and terrorism, etc.)
- software (bugs, malicious intrusion, cyberterorism/sabotage)
- human (operator error, disgruntled employees)
- system complexity (increasing # of interdependencies and automation --> cascading
effects)
Examples of large-scale failures:
- The Western power outage (Aug. 10, 1996): a transmission cable sagged onto a tree limb
causing a power failure, which cascaded across 13 states, parts of Mexico and Canada,
affecting 7.5 million customers. Telecom industry relied on backup generators and power
was luckily restored before generators ran out of fuel. Estimated financial losses for
California: $1-3 billion.
- Dec. 18, 1997: Accidental disruption of electric power --> outage in the
transportation infrastructure. Kansas City air traffic control system outage lasted 2
hours, 40 minutes, affecting hundreds of flights and tens of thousands of passengers.
- Feb. 20, 1998 - Mar. 27, 1998: Auckland power outage lasted 36 days. A heat-wave induced
power cable failure started a massive cascading failure through other infrastructures,
disrupting medical and emergency services, causing traffic snarls, taking out
telecommunications, and eventually causing businesses to abandon the Central Business
district of Auckland, which produced heavy economic losses in the region.
- May 19, 1998: "Galaxy 4" satellite malfunction caused comm. loss for nearly 90
percent of the 45 million pagers in the U.S., caused a disruption of service to medical
and law enforcement personnel as well as financial services to banks, retail stores, and
gas stations.
- Oct. 23, 1997: A disgruntled employee flipped 39 switches in the PG&E Mission Street
substation (San Francisco). The resulting power outage affected the transportation
infrastructure and thousands of morning commuters due to failed traffic signals and the
closing of two BART stations.
- Ariane 5 missile launch ???
SANDIA defines the term of surety = reliability in normal environments,
safety in abnormal environments, secruity and use control under hostile circumstances.
Surety Science and Engineering is aimed at building and retrofitting large system
infrastructures, with a particular emphasis on the human management of these systems. Our
class will not look into such types of activity, although they are critical for the proper
functioning of such large scale systems.
What characterizes some of these systems (and why do we care) ?
- mission criticality (e.g., defense), same as
- safety requirement (e.g., air traffic control) ?
- huge scale
What kind of systems will we be interested in tomorrow?
- improved versions of the ones above
- Internet-based services (massive scale, mission critical)
- automated transportation (SeaTac airport train), highways (GM efforts)
- international/corporate electronic espionage/sabotage prevention
- increased dependability on computing (pervasive computing, etc.)
Do we need such a level of robustness ? Will we need more/less reliability in the
future?
- increasing complexity, size, and importance of computer systems can lead to collapse
under own burden (NT 5.0 has 40 million lines of code; Solaris 7 has 12 million lines of
code, acording to Java World, July 1998)
- software systems still are major src of security/safety catastrophes. Yet, we rely every
day more and more on software systems in our everyday lives.
- high complexity means low reliability
Potential requirements for these systems:
- downtime
- graceful crashes + restarts
- self-restarting
Terminology
- what is a fault? what is a failure?
- safety-critical vs. mission-critical
- stopping failures, Byzantine failures
- failstop, failsafe, nonstop
Real systems vs. Formal methods
- evolution
- synergies
- antipathies
- common future ?
fault tolerance, resistance, fault vs failure, ... dependability, reliability,
availabilty, ... common mode (non-isolated, non-indept) failures. Safety-critical vs
mission-critical systems: requirements, design approaches. Is downtime allowed? How much?
Does system have to crash gracefully? Does it have to restart itself? Concept vocabulary:
failstop, failsafe, nonstop...
Systems approaches: formal (theory of fault tolerance), practical. Current
relevance: mission/safety critical systems; COTS-based systems. Why it's hard to
assure integrity of COTS based systems (vs. dedicated monolithic designs) but why this is
a fact of life (monolithic designs are not only cost-ineffective but too complex for one
person/team to implement if they can't use prior work as building blocks).
Why worry about safety?
- On the average, about 300 persons a year die in commercial air crashes in the United
States,while thousands of children are killed in accidents, most of them in and around the
home. And yet, we feel in our bones that flying is dangerous and it is. .... When airlines
say they have records of '100 million passenger-miles without an accident,' the
implication is that all those miles were equally hazardous, which is preposterous. Most
accidents happen in and around airports, so the distance between takeoff and landing does
not count for very much. (Prof. Edmund J. Cantilli, a transportation safety expert,
Newsday Magazine, 12/12/82) In other words, if you total distances traveled by airplanes
around airports, bus and intercity buses and trains are safer than flying commercial
aircraft and if you total annual deaths due to transportation, subways are safer than
flying. So much for the myth of flying-as-safest-mode-of-transportation!
- Talk about TWA 800, ValuJet 592, DC-8 cargo carrier accident in Kansas City, Amtrak
derailment in Arizona.
In reality, computer systems usually consist of:
- hardware
- software
- human operators
- policies
- etc.
That is what makes computer systems so complex. To make systems correct, we can take
two approaches:
- Build a correct computer system, that will never fail or that will correctly tolerate
failures. Trying to prove a computer system correct is a daunting task, because of the
huge complexity.
- Build a reasonably correct system and provide redundant mechanisms to increase the
probability of its correctness. Is this graph true (confidence = freedom from bugs)?
Once we pick a point on this curve, it is easier to move it up by
using certain cheap (in terms of effort) mechanisms (e.g., orthogonal, such as the phone
switch hack) because we combine several of these curves in their middlepoints and, by
multiplication, move the point up the curve.
Does this contradict Lampson? ("Making a system reliable is not
really hard, if you know how to go about it. But retrofitting reliability to an
existing design is very difficult"). No, because here we are not retrofitting to an
existing design, rather designing a new system in which the old one is a subsystem.
Such an approach would lend itself excellently to building reliable
systems out of COTS components (we do it all the time. CRCs, for instance). Using the
end-to-end approach is a good way to provide reliability on top of unreliable components.
In regard to availability, there are two approaches:
- system that never fails: no internal faults (provably perfect system) and no external
faults (provably perfect conditions); very difficult both because of uncertainty and
because of the unsurmountable problem of recursively proving that every layer of the
system is correct
- system that contains sufficient redundancy so that it always has the minimal subset of
components necessary to perform according to its specification (this includes intelligent
systems that would have the ability to dynamically respecialize components)
Such redundancy comes in a variety of flavors: orthogonal mechanisms (Fox's favorite),
replicate the system's state machine (Lamport's idea, Lampson's favorite) etc.
Current state of things: "Perfection is basically impossible. Instead, be prepared
to practice risk management, to detect problems, to design fallback mechanisms and
countermeasures, and to recover." (Ron Rivest, paraphrased). We want to try to
systematize the ad-hoc approaches to FT.
Lamport: "A distributed system is a system in which I can't do my work because
some computer has failed that I've never even heard of." --> let's do away with
that.
General structure of lectures:
- 25 minutes: concepts, "robustification" techniques
- 20 minutes: case study -- identify concepts in analyzed system
- 30 minutes: class discussion of 2nd case study -- identify concepts, critique the design
in view of FT
(the class could be sort of an expanded Chapter 4 of Hints)
Internet-based systems, Internet as a test bed, factor-of-10 rule
Assignment ideas:
- Survey the students' knowledge of the systems we are going to cover, in order to better
gauge the discussions (and perhaps select students for the class)
- What gave birth to the Y2K problem? How could it have been prevented? How can the
problem be retro-fixed now? How can the problem be "tolerater" after it happens?
Systems Landscape and Concept Vocabulary [make this a handout?]
- Redundancy -- a double edged sword (consistency, overall failure rate, need to mask
failures transparently, larger system cost and maintenance cost).
- Persistent vs. soft state.
- Announce/listen and the use of multicast, vs. explicit name bindings (i.e. why is
multicast OK for "small" announce/listen systems, but inappropriate as the main
mechanism for, say, DNS?)
- Loose vs. tight coupling and the relative costs of each (weaker consistency, worse
performance, but better failure semantics)
- The end-to-end argument and how it applies to error handling
- Harvest and yield
- Nonstop systems and process pairs
- Transactions; crash recovery vs. crash avoidance
- "Expected" vs. "unexpected" failure modes; defensive programming
- What are the expected applications for these systems? The Internet and phone companies
can tolerate, generally, a few percentage points of failure ("tududuu... all circuits
are busy now, please try your call later"); a bank, however, cannot (how about this:
"you have about $1500 in your account +/- a couple hundred"...)
Assignment ideas
:
Lecture 2: Redundancy
- redundancy = "double-edged sword" (consistency, overall failure rate, need
to mask failures transparently, larger system+maintenance cost)
- replication
- N-version programming + consensus
- functional redundancy
- hypervisors
- non-stop kernel, Tandem Himalaya, 2-of-everything approach
- hardware redundancy vs. software redundancy
-
- RAID as a fault-tolerant system
- Guardian paper
- Jim Gray. A census of tandem system availability between 1985 and 1990. IEEE Trans.
Reliability, 39(4):409-418, October 1990.
- Implementing fault-tolerant services using the state machine approach: a tutorial,
by F. B. Schneider, ACM Computing Surveys, 22(4):299-319, Dec. 1990 [PDF] [HTML]
- Hypervisor-based fault tolerance, by Thomas C. Bressoud and Fred B.
Schneider, ACM Transactions on Computer Systems, 14(1):80-107, Feb. 1996 [PS] [PDF]
- "Fault tolerance", by Jim Gray and Andreas Reuter, Transaction Processing:
Concepts and Techniques, Morgan Kaufmann, 1993, pages 93-156
- How to build a highly available system using consensus, by B. Lampson, in
Distributed Algorithms, ed. Babaoglu and Marzullo, Lecture Notes in Computer Science 1151,
Springer, 1996, pp 1-17 [PS] [PDF]
Assignment ideas
:
Lecture 3: Fault Isolation/Containment. Imperviousness
- sandboxing
- firewalls
- SFI
- Mach
- virtual memory / address spaces
Concept of a sandbox and software fault isolation. Examples: binary rewriting,
Janus, firewall. Using language support in sandboxing: type-safe languages (Spin and
Modula-3), Exokernel (UDF's). Why sandboxes are an orthogonal mechanism; orthogonal
mechanisms in general. Other examples of orthos: deadlock recovery in DB's; Berkeley
TACC work. Orthogonal mechanisms and simplicity: the joy of a small state space,
esp. when composing n pieces, and how (whether?) this relates to formal techniques
(being able to specify invariants because the state space is small enough to formally
reason about). Mars Pathfinder example (rebooting is an extreme orthogonal mechanism
and not necessarily one that is always an option).
- "Efficient software-based fault isolation", Wahbe, Lucco, Anderson, Graham, in
SOSP-14, pp. 203-216, Dec. 1993
- some Java ref. for sandboxing
- some firewall ref. for imperviousness
- a Mach ref. for general fault isolation
- Hardware fault containment in scalable shared-memory multiprocessors, by
Dan Teodosiu, Joel Baxter, Kinshuk Govil, John Chapin, Mendel Rosenblum, and Mark
Horowitz, Proc. 24th International Symposium on Computer Architecture, June 1997 [PS]
- Implementing Efficient Fault Containment for Multiprocessors, by Mendel
Rosenblum, John Chapin, Dan Teodosiu, Scott Devine, Tirthankar Lahiri, and Anoop Gupta,
Communications of the ACM 39(9):52-61, Sept. 1996 [PS]
- Designing a global name service, by B. Lampson, Proc. 4th ACM Symposium on
Principles of Distributed Computing, Minaki, Ontario, 1986, pp 1-10. [PS] [PDF]
Example of the importance of fault isolation:
- Nov. 12, 1975, JFK airport: During the takeoff roll, a DC-10-30 struck sea gulls and the
takeoff was rejected. The right engine disintegrated and caught fire; several tires and
wheels disintegrated; and the aircraft did not decelerate as expected. Near the end of the
runway, the captain steered the aircraft onto a taxiway; the landing gear collapsed, and
most of the aircraft was consumed by the fire. Of the 139 persons on the aircraft, 2 were
seriously inured and 30 were slightly injured. The NTSB determined that the probable cause
of the accident was the disintegration and subsequent fire in the No. 3 engine when it
ingested a large number of sea gulls. Following the disintegration of the engine, the
aircraft failed to decelerate effectively because: (l) The No. 3 hydraulic system was
inoperative, which caused the loss of the No. 2 brake system and braking torque to be
reduced 50 percent; (2) the No. 3 engine thrust reversers were inoperative; (3) at least
three tires disintegrated; (4) the No. 3 system spoiler panels on each wing could not
deploy; and (5) the runway surface was wet. The following factors contributed to the
accident: (l) The bird-control program at JFK Airport did not effectively control the bird
hazard on the airport; and (2) the FAA and the General Electric Company failed to consider
the effects of rotor imbalance on the abradable epoxy shroud material when the engine was
tested for certification.
Assignment ideas
:
- Contradict the usefullness of fault isolation in achieving reliability.
Lecture 4: Language-based Mechanisms. End-to-End + Layers + Hierarchies
- meta-issue: achieving reliability on top of unreliable layers
- defensive programming
- language support (Spin, Vino, exo UDFs and DPFs)
- Janus, Slic ???
- Engler's latest stuff
- analysis of bug types, percentages, what caused them, etc., can argue for complexity of
man-computer system (Yu paper)
- end-to-end approach (note Lampson's comments in Hints)
- relationship between end-to-end and formal methods vs. practical FT
- Jerome H. Saltzer, David P. Reed, and David. D. Clark. End-to-end arguments in system
design. ACM Transactions on Computer Systems 2, 4 (November, 1984) pages 277-288.
- Reliability hierarchies, by Peter M. Chen and David E. Lowell, Proc. 1999
Workshop on Hot Topics in Operating Systems (HotOS), March 1999 [PS]
- A software fault prevention approach in coding and root cause analysis, by
Weider D. Yu, Bell Labs Technical Journal, Apr-Jun 1998 [PDF]
[HTML]
- Reliable messages and connection establishment, by Butler W. Lampson, in
Distributed Systems, ed. S. Mullender, 2nd ed., Addison-Wesley, 1993, pp 251-281 [PS] [PDF]
- Reliable communication in the presence of failures, by Kenneth P. Birman
and Thomas A. Joseph, ACM Transactions of Computer Systems, 5(1):47-76, Feb. 1987 [PDF]
- The Design Philoshophy of the DARPA Internet Procotols David D. Clark, Proceedings of
the 1988 SIGCOMM Symposium, pp 106-114, Stanford, CA, August 1988.
- The hardest multiplexing problem: disk (ch. 4) and Reflections on
downloading code (ch. 6), in The exokernel operating system architecture, by
Dawson R. Engler, Ph.D. thesis, Massachusetts Institute of Technology, October 1998 [PS]
- Dealing with disaster: surviving misbehaved kernel extensions, by Margo I.
Seltzer, Yasuhiro Endo, Christopher Small, and Keith A. Smith, Proc. 2nd Symposium on
Operating System Design and Implementation, Seattle, WA, Oct. 1996, pp. 213-228 [PS][HTML]
- Extensibility, safety, and performance in the SPIN operating system, by
Brian Bershad, Stefan Savage, Przemyslaw Pardyak, Emin G. Sirer, Marc Fiuczynski, David
Becker, Susan Eggers, and Craig Chambers, Proc. 15th Symposium on Operating System
Principles, Copper Mountain, CO, Dec. 1995, pp. 267-284 [PS]
Assignment ideas
:
Lecture 5: Orthogonality, Application Decomposition and Restructuring, Design for
Failure
"The unavoidable price of reliability is simplicity" (Hoare) That's cool,
because our approach using orthogonal mechanisms is simpler than building reliability into
each component, because the orthogonal components are (largely) independent and so they
contain/limit the extent of complexity.
Long-term maintainability, incremental upgradability (the Internet vs. the power grid);
how is this done? ("infrastructure" systems vs. "control"
systems?)
- harvest/yield tradeoffs, Inktomi (probabilistic consistency/completeness)
- exponential backoff (Oracle anecdote)
- SNS/TACC: retry/restart mechanisms
- Brooks' subsumption architecture
- transactions give us atomic, restartable actions (graceful crashes...) - databases,
crash recovery vs. crash avoidance
- (touch also upon Lomet's ideas)
How to exploit application semantics (or user expectations) in designing a system to
accommodate various kinds of "expected" failures. Examples: MS Tiger
fileserver, ATM machine, airline reservations system, Internet-scale servers
(harvest/yield tradeoffs)
What really happened on Mars
-
- Nancy G. Leveson and Clark S. Turner. An investigation of the Therac-25 accidents.
Computer 26, 7 (July, 1993) pages 18-41.
- Seltzer, M., Small, C., Smith, M., Symbiotic Systems Software, Workshop on Compiler
Support for Systems Software (WCSSS '96). [PS]
- Efficient Transparent Application Recovery in Client-Server Information Systems,
by David B. Lomet and G. Weikum, Proc. ACM SIGMOD Conference, Seattle, WA (June 1998), pp.
460-471 [PS]
- Bayou: Replicated Database Services for World-Wide Applications, by Karin
Petersen, Mike J. Spreitzer, Doug B. Terry, and Marvin M. Theimer, Proc. Seventh ACM
SIGOPS European Workshop (EuroSIGOPS '96), Connemara, Ireland, Sept. 1996, pp. 275-280 [PS]
- The CIRRUS banking network, by David Gifford and Alfred Spector,
Communications of the ACM 28(8): 798-807, Aug. 1985 [PDF] [commentary]
--> a good example of how out-of-band mechanisms are employed to robustify the ATM
network
Example of the need to design for failure (the safety mechanism itself failed in a
byzantine way):
- Nov. 16, 1976, Denver Airport: A McDonnell Douglas DC-9-14, crashed after rejecting a
takeoff from runway 8 right at Stapleton International Airport, Denver, Colorado. The
takeoff was rejected when the stall warning stick-shaker activated after the aircraft had
rotated for takeoff. When the pilot was unable to stop the aircraft within the confines of
the runway, it overran the runway, traversed drainage ditches, struck approach light
stanchions, and stopped. Eighty-one passengers and five crewmembers evacuated the
aircraft, which had been damaged severely by impact and fire; 14 persons were injured. The
National Transportation Safety Board determines that the probable cause of this accident
was a malfunction of the stall warning system for undetermined reasons which resulted in a
false stall warning and an unsuccessful attempt to reject the takeoff after the aircraft
had accelerated beyond refusal and rotation speeds. The decision to reject the takeoff,
although not consistent with standard operating procedures and training, was reasonable in
this instant case based upon the unusual circumstances in which the crew found themselves,
the minimal time available for decision, and the crew's judgment concerning a potentially
catastrophic situation.
Do you believe current airplanes are built to protect passengers in the event of
crashes? To what extent? They are quite flimsy. Car manufacturers have started worrying
about such thing only in recent years. In terms of safety, the conventional aircraft is a
disaster when it comes to fuel alone. Fuel is stored in the wings, directly above the
sources of flame (the engines) and under the passengers in the center of the aircraft
(this was the fuel tank which exploded in TWA Flight #800). Some of the newer aircraft
even store fuel in the horizontal tail fins. In case of accident, passengers are
surrounded by fuel - alongside and behind and below (when gravity and the directional
momentum of the aircraft are taken into account, this leads to most passengers being
encircled by burning fuel). On Burnelli aircraft, the fuel tanks are only in the wings,
with no fuel under or behind the passengers. The landing-gear is retracted into the body
which is the main structure of the aircraft. Furthermore, the engines are on the top
rear-most portion of the aircraft, away from the fuel tanks. All of the most volatile
components have been isolated.
Assignment ideas:
- Attempt to demonstrate that using orthogonal approaches to achieve reliability is not
feasible.
Lecture 6: Transactions, ACID vs. BASE, Soft state & announce/listen
Transaction = an exception handling mechanism (if something fails, it gives you the
ability to return where you left from and start again). As an unfortunate consequence,
recursive/nested transactions are problematic (some parts of the trans succeed, others
fail, so you roll back everything, including the successful parts).
Transaction concept as a building block for FT systems; ACID semantics, why they're
hard to get right and even harder to get good performance as well. Why crash
recovery under ACID is hard; write ahead logging, etc. The price of the
transactional abstraction.
Loose coupling with Soft state and announce/listen: design of wide-area Internet
protocols, including route advertisement, multicast routing, RSVP, SRM. Role of soft
state and announce/listen in these systems. Extensions- using these mechanisms at the
application level.
"What's cool about transactions is that the semantics are so simple" (Jim
Gray, personal communication)
Transactions cannot be really viewed as an orthogonal FT mechanism, because they rely
on a multitude of layers below them; the transaction is a the top of pyramid.
- The TWA Reservations System, by David Gifford and Alfred Spector,
Communications of the ACM 27(7) --> good example of trading consistency for
availability and performance
Assignment ideas:
- Certain FT techniques hit a brick wall when the system's complexity increases. Which of
the FT mechanisms we've discussed so far cannot scale to the size/complexity of future
systems? (this is a good segue to the next lecture: self-understanding systems as a
solution to the scalability of traditional FT)
Lecture 7: Introspective Sys, Self-Repairing Sys, Embryonic Sys, Evolvable Sys,
Self-Reconfiguring Sys
The mythical telephone-switch-software daemons that run around and fix up corrupted
data structures [need a ref for this!]. ISTORE and other self-monitoring
systems. If these techniques were more disciplined, woudl there be a place for
this as a first-class design methodology in a software engineering curriculum?
- ISTORE: Introspective Storage for Data-Intensive Network Services, by
Brown, A., D. Oppenheimer, K. Keeton, R. Thomas, J. Kubiatowicz, and D.A. Patterson, Proc.
7th Workshop on Hot Topics in Operating Systems (HotOS-VII), Rio Rico, Arizona, March 1999
[PS]
- Autonet: A high-speed, self-configuring local area network using point-to-point
links, by Michael Schroeder, Andrew D. Birrell, Michael Burrows, Hal Murray, Roger
M. Needham, Thomas L. Rodeheffer, Edwin H. Satterthwaite, Charles P. Thacker, IEEE Journal
on Selected Areas in Communications, 9(8): 1318-1335, Oct. 1991 [PS]
[PDF]
- Fault-tolerant Systems: The way Biology does it!, by Cesar Ortega and Andy
Tyrrell, Proc. Euromicro 97 (Short Contributions), Budapest, IEEE CS Press, Sept. 1997,
pp.146-151 [PDF]
- Computer Know Thy Self!: A Biological Way to Look at Fault-Tolerance, by
Andy Tyrrell, Proc. 2nd Euromicro Workshop on Dependable Computing Systems, Sept. 1999
- Self-Organizing Collaborative Environments, by Balakrishnan, H., Seshan,
S., Bhagwat, P., and Kaashoek, F. Position paper for the NSF/DARPA/NIST Workshop on Smart
Environments, Atlanta, GA, July 1999. [TXT]
- Building an Artificial Brain Using an FPGA Based CAM-Brain Machine, by
Hugo de Garis, Michael Korkin, Felix Gers, Eiji Nawa, Michael Hough, Applied Mathematics
and Computation Journal, Special Issue on "Artificial Life and Robotics, Artificial
Brain, Brain Computing and Brainware", North Holland, 1999 [PS] [PDF]
- Evolvable hardware: The genetic programming of Darwin machines, by Hugo de
Garis, International Conference on Artificial Neural Nets and Genetic Algorithms, 1993,
Innsbruck, Austria [PS] [PDF]
- Baring it all to Software: Raw Machines, by Elliot Waingold, Michael
Taylor, Devabhaktuni Srikrishna, Vivek Sarkar, Walter Lee, Victor Lee, Jang Kim, Matthew
Frank, Peter Finch, Rajeev Barua, Jonathan Babb, Saman Amarasinghe, and Anant Agarwal,
IEEE Computer 30(9):86-93, Sept. 1997 [PS] [PDF]
- Raw Computation, by Anant Agarwal, Scientific American, August 1999 [HTML]
Assignment ideas
:
- Argue against the future of "self-understanding" systems.
-
Lecture 8: Wishful Thinking / Guest Lecture. Wrapup
- Identify how to combine systems techniques w/formal techniques. General strategy:
find out what properties, invariants, etc. are enjoyed by software constructed "the
right way" and see whether the above mechanisms can be used to give these properties
to software written "the wrong way".
- Project issue ??? ( ?)
Assignment ideas:
- 2nd Y2K assigment (to see how much people have learned): how would you retro-fix Y2K?
Unassigned refs:
- The Systematic Improvement of Fault Tolerance in the Rio File Cache, by
Wee Teck Ng and Peter M. Chen, Proc. 1999 Symposium on Fault-Tolerant Computing (FTCS) ,
June 1999 [PS]
- Crash recovery in a distributed data storage system, by Butler W. Lampson,
unpublished technical report, Xerox Palo Alto Research Center, June 1979 [PS] [PDF]
- Jim Gray and Daniel P. Sieworek. High-availability computer systems. Computer 24, 9
(September, 1991) pages 39-48.
- Henry Petroski. Engineering: History and failure. American Scientist 80, 6
(November-December, 1992) pages 523-526.
- "Vision for High-Availability of the Global Infrastructure", panel at the 28th
Annual International Symposium on Fault-Tolerant Computing, 23 - 25 June, 1998, Munich,
Germany (included Ram Chillarege [IBM], Robert Horst, Clifford Meltzer, Angelo Pruscino
[Oracle] and Dalibor F. Vrsalovic)
- "TFT: A Software System for Application-Transparent Fault Tolerance", by
Thomas C. Bressoud. In Twenty-Eighth Annual International Symposium on Fault-Tolerant
Computing, Pp. 128-137, June 1998. [PDF]
- Alfred Spector and David Gifford. The space shuttle primary computer system.
Communications of the ACM 27, 9 (September, 1984) pages 874-900.
- Why cryptosystems fail, by Ross J. Anderson, Communications of the ACM
37(11):32-40, Nov. 1994) [PS] --> ATMs
- Ptolemy/DS1 probe (Barney Pell)
- A computer science perspective of bridge design, by Alfred Spector and
David Gifford, Communications of the ACM, Vol. 29, No. 4 (April 1986), Pages 267-283 [PDF] --> an interesting way to look at other
engineering fields and try to apply the lessons to the computer systems field
- Case study: IBM's system/360-370 architecture, by David Gifford and Alfred
Spector, Communications of the ACM, 30(4): 291-307, Apr. 1987 [PDF]
Homework assignments (currently unsorted and unmatched to topics)
- Discuss some ways to retrofix Y2K?
- Argue some convincing cases in which orthogonal mechanisms are of limited utility.
Programming/Lab assignments
- Robustification library. What could you put in it? How would you express the operations
it can do in such a way that applications can pick their recovery policies? Should it be
kernel code or a user level library? Would it be truly orthogonal (like a sandbox) or
quasi-orthogonal (like a debug-malloc library)?
- Hazard analysis of a "large-component" system, e.g. a system in which the
"components" are things like Apache, Oracle DB server, etc. Follow the same
methodology as for "traditional" failure analysis. How would the analysis change
when glass-box info from the developer of each component is added?
- Simulator that generates "directed random" failures, either as a source of
stimulation or as a SimOS-type library that apps can be run on top of for testing.