Recovery-Oriented Computing

Project Information

updated 12 May 2002

Timeline (tentative)

Thu 10/18: in-class project proposal presentation (~3-4 minutes each) and debugging
11/1 and 11/15: brief project progress meetings (in class)
11/22, 11/29: no class meetings (use the time to work on projects)
12/6: Project conference talks (in class)
Friday 12/7: Project poster/demo session at Stanford, followed by TGIF (happy hour)
12/10: conference-quality papers due

Project Web Pages

Synthesis of distributed fault-tolerant schedules (Claudio Pinello and Sam Williams)
Neighbor Discovery in P2P Systems: Preparing For Failures (Pedram Keyani, Brian Larson, Muthukumar Senthil)
Recursive Restartability In a Networked Satellite Ground Station (Rakesh Gowda and Rushabh Doshi)
Fault Detection in a Networked Ground Station (James Cutler and Priyank Garg)
Making Chord Robust (Dan Adkins)
FIG: Fault Injection in glibc (Pete Broadwell, Naveen Sastry, Jonathan Traupman)
Query Processing in a Hostile Environment (Sailesh Krishnamurthy and Satrajit Chatterjee)
Automating Root-cause Analysis (Mike Chen, Eugene Fratkin, Emre Kiciman)

Project Proposal Requirements

Your proposal must address the following issues - preferably as bullet items or subheadings. One person from each group should be prepared to potentially present the proposal in class, using one PPT or transparency or just talking for 3-4 minutes max:

What is the interesting research question/problem area being addressed (be specific) Ideally this should be framed in the form of a hypothesis: "We stipulate that the following is the case..." - this can refer to measurements, the possibility that something can be designed/built, etc. Your project will then attempt to support or weaken that hypothesis.
What is the approach/strategy (again, be specific... will you measure something? what, and where? will you build a prototype in order to measure it? can you convince us you have enough time?)
Who has done similar things, and why is this an improvement or how does it fill a gap? (i.e. you should definitely do a preliminary but broad literature search before starting the project)
What deliverables are expected (data collection + analysis, working prototype, etc)

We will still help people debug proposals and where possible refine the proposed approach, etc. We may reserve the right to assign projects if people can't come up with well-circumscribed proposals that satisfactorily address the above points.

Existing projects looking for more people:

None yet. If you have a project idea and are looking for a partner, send a one-paragraph description to Aaron Brown and it will be posted here.

Suggestions for general project themes:

These are some general themes that might make interesting projects in the area of Recovery-Oriented Computing and system dependability. Most of these themes will have to be narrowed down significantly to make practical projects.

Virtual machines for dependability. We've had several discussions in the class about system virtualization techniques and the ways that they might contribute to dependability, for example by isolating applications from each other, by providing for more efficient failover, by providing an environment for fault-injection-based robustness verification, and by providing a means of intercepting and analyzing low-level resource requests to detect abnormal behavior. Possible projects might involve selecting one or more of these ideas and implementing/evaluating them in the context of a real virtual machine environment.
Dependability metrics and benchmarking techniques. A key component to the success of the ROC project (and indeed of any research in Internet systems dependability) lies in quantifying the dependability gains achieved through ROC techniques. This requires representative, reproducible, and widely-applicable metrics, as well as techniques for measuring them. There are several possible projects in this area, including following up on the earlier work in this area by members of the ROC project and the IFIP Dependability Benchmarking working group, and in devising dependability metrics and benchmarks for systems that haven't yet been analyzed for dependability.
Human error, maintainability, and dependability. A theme we've brought up throughout the course is the importance of human beings in system dependability. There has been a lot of work on human factors in safety-critical systems, but very little consideration of humans in the administration/maintenance side of Internet-service-type systems. Possible projects in this area might include developing and carrying out pilot human studies with existing systems in order to quantify the kinds of mistakes people make while administering them, attempting to build models of administrator error that could be used to drive automated maintainability benchmarks, or developing new administration models that are more forgiving human error.
Case studies. A major problem plaguing dependability research is the lack of data on real system failures. A valuable project would be to take a real, deployed system and perform a case study on its failure modes and their root causes. This could be done in cooperation with a commercial site (if access to data is available) or by studying particular systems at Stanford or Berkeley (such as Berkeley's long-suffering NetApp filer), among other possibilities.
Fault simulation and injection. Several of the techniques proposed for ROC systems rely on a means of simulating or generating realistic system failures. Unfortunately, most existing work on failure simulation has focused on very low-level hardware fault-injection (i.e., flipping bits in the processor or memory). What's needed is a system for injecting more realistic high-level faults, capturing the failure modes of peripherals (like hard drives), networks, device drivers, and software components. Especially useful would be a software framework for generating/simulating these failures--this could be used for dependability benchmarking as well as ROC's proactive verification techniques. One or more projects could be defined to develop all or part of such a framework, perhaps in the context of a virtual machine environment.
ROC design techniques. The ROC project has identified many possible system design techniques that might increase dependability (see Aaron's quals talk or the ROC overview talk for a list). These are the source of many possible projects: pick an interesting technique and implement it in the system of your choice, then evaluate that system for dependability improvements.
Ground-station related projects. <insert description>
[2 unit option] Thorough survey paper on systems dependability techniques. As we have lamented several times in lecture, there are no good all-encompassing survey papers on dependability techniques at a systems level. Writing one would be a great 2-unit project and would also be a great service to the community.

Specific project suggestions:

The following are suggestions for specific projects that have been proposed by the class staff and industry experts connected to the ROC project:

[Brendan Murphy] A project that springs to mind is developing tools to measure the availability of a web site. This appears to be relatively easy but you have to take into consideration how to handle delayed responses (i.e. what is your definition of up is it a response in 5 seconds and if you need a retry does that mean it is degraded?). As your course is a joint Stanford/Berkeley you have the opportunity of monitoring the availability of a web site internally and externally (i.e. monitor a Berkeley web site within your firewall and also from Stanford i.e. outside your firewall) and compare the results. The classic way of doing this is to have a web page with a defined response (so the monitor issue a defined URL and validates the complete response, note simply pinging is not sufficient) so you can check if you receive a response and the response is the total page. The setting up of the project requires the people to think through the problem but it is unlikely in 3 weeks that they will get any data BUT if you leave it running, it allows follow on projects to try to come up with availability and reliability measurements.
[Jim Gray] Benchmarking restart times... Yes, you can measure OS, spooler, RAID, cluster failover, FileSystem, DBMS, WebServer,...... You can measure MTTR (very important metric) and measure the tail (needs human intervention) and measure success rate (what % of data and work lost). Since unavailbity is MTTR/MTTF and since we do not know how to make MTTF infinite, we want to try to drive MTTR to zero. This is about measuring how close to zero we are, and what the distribution looks like.
[James Cutler] Apply ROC to the Federated Network of Ground Stations (FNG) being deployed by Stanford's Space Systems Development Lab. Specifically, apply three kinds of monitoring--component checks, end-to-end checks, and cross-checks--to monitor performance of one or more FNG nodes and detect failures. (This problem is potentially interesting because it involves the use of separate sites for cross-checking to detect a failure; i.e. a failure at site A can be confirmed/inferred by looking at data from site B. This idea should also be applicable to replicated Web services.) Goal of project would be to extract lessons from this case study that could form basis of a methodology for applying ROC.
[George Candea] Apply ROC to Stanford Interactive Workspace; same general principles as #3.
[Aaron Brown and Jim Gray] Database availability benchmarking. This project would be a follow-on to earlier work by Aaron on using fault-injection to quantify the availability of a three-tier database (see this paper). The earlier work only looked at disk faults in the database and end-user-perceived performance as a measure of availability. It would be very interesting to extend this work to include the injection of faults into more components of the system (including the middleware and front-end), and to look at more detailed dependability metrics for the system.
[Aaron Brown] Conservation laws and dependability. Physical systems (those that deal with tangible entities like fluids, gasses, etc.) behave according to the laws of physics. In particular, there are physical conservation laws that make it possible to detect complex, hidden problems through simple analyses of material flow, production, and destruction. For example, a leak in a power plant can be detected by monitoring the coolant flow at various points and noticing unexpected differences.
An interesting question is whether similar conservation-law-analysis techniques could be applied to computer systems by treating data as a conserved entity. Are there conservation laws that make sense for data in Internet systems? It seems that the answer is yes--for example, a piece of incoming email is a set of data that should be conserved as it traverses the system from SMTP socket to mail router to storage to IMAP server, with other conservation laws influencing headers and MIME parts.
This project would investigate the utility of conservation laws for enhancing system dependability by detecting latent errors and problems before they become significant. You would pick a system, devise conservation laws, implement monitors to detect non-conservation, then demonstrate that your system detects simulated problems.
[Aaron Brown] Using hacker techniques to improve maintainability. System administrators bemoan the lack of tools to effectively administer and maintain large clusters of machines, especially for Windows machines. Hackers, however, seem to have no trouble harnessing and coordinating huge and far-flung groups of machines to perform denial of service attacks. A possible project is to study the techniques used by hackers, worms, viruses, etc., and figure out if and how they can be co-opted for more productive uses, particularly in improving dependability or maintainability.
[Pedram Keyani, Brian Larson, Muthu Senthil] Failures in P2P systems and recovering from malicious or malfunctioning nodes, worms and defining and maintaining metrics for failures in distributed systems. The term failure has not been fully defined for pure P2P systems yet. We feel it is important to define metrics to describe availability of P2P systems. In such a non-cooperative environment what is important to each node is that their individual needs are satisfied. With this in mind we want to explore different ways in which local failure policies can be applied to benefit the entire system. Nodes should have the capability to collect and maintain metrics about the behavior of their neighbors and react accordingly. Metrics could include such things as number of files shared, bandwidth, queries handled, average time online and other metrics which we are still discussing.

Some ideas from past iterations:

Robustification library. What could you put in it? How would you express the operations it can do in such a way that applications can pick their recovery policies? Should it be kernel code or a user level library? Would it be truly orthogonal (like a sandbox) or quasi-orthogonal (like a debug-malloc library)? For some good ideas about this, (re-)read Dawson Engler's paper on Interface Compilation. One example along these lines: most Unix apps don't error-check the results of system calls (fopen, malloc, etc.). Annotate program blocks to indicate how to recover in these cases, in case the programs don't have their own checking/recovery code. Use a tool like Janus to intercept the system calls and implement the behaviors. (As another motivation, Janus often protects against unsafe behavior by causing a system call to fail. A robustification library might allow this to occur and still have the program keep running in a nice way.)
Simulator that generates "directed random" failures, either as a source of stimulation or as a SimOS-type library that apps can be run on top of for testing. (Alternatively, find one and use it for the project.)
Generic, templatable highly-available soft-state server with well-defined failure and weak-consistency semantics.
Implementing global assertions. A set of redundant mechanisms -- within each level of abstraction, and at different levels of the implementation -- that continuously check global assertions, and provision for calling (one of) a set of "assertion failed handlers", possibly with additional state as arguments, when the assertions fail. Could this be done using polling? Interrupts/notification? How could existing programming tools be augmented to include support for global assertion checking?
Daemon robustifier. Parameterized monitoring, and parameterized restart/reinit, for daemons. End-to-end operational checks, restarts with limit count, etc. End-to-end check must verify liveness and correct operation, not just "upness" in the ps sense. Restartability must be reasonable in the face of several conditions: (a) maybe the problem is really elsewhere, like in the network. In this case you must limit the number of restarts so you don't consume resources forever. (b) Maybe the problem is in some other, lower-level component, e.g. maybe the network interface isn't working at all, so all network programs will appear to be broken. In this case, could the lower-level component(s) be recursively protected? etc. Similarly, restarting may have some constraints. For example, pppd requires access to the serial port, so starting one when a wedged one is running won't work - the wedged one must be killed completely or no restarts will work. So you'd need a way to verify that something is really gone. The tricky case involves processes that use TCP connections: even after the process has been killed, the connection it used may be stuck in the FIN_WAIT state (a well known problem) preventing any proceses from binding to that TCP port until after a 3-minute (?) timeout has passed. The recovery/monitoring code must be aware of circumstances like this in order to act intelligently. If it can do nothing, it should try to notify a human via email or pager (how can it check that this worked?). Bonus question: who watches the watcher? i.e. what if the daemon-watcher itself fails? Could you run two of them and have them watch each other, as was done with process peers in Berkeley SNS/TACC?

Project Teams

Current enrollment list:

George Candea (staff)	Aaron Brown (staff)
Emre Kiciman	David Oppenheimer
Jamie Cutler	Pedram Keyani

Eventually the above table will morph into a list of project teams.

Evaluation Criteria and Deliverables

Dec. 10 Conference-quality project report. We expect some of them to be submitted as conference papers.
Dec. 6 (in class) Short (15-min.) project talk.
Dec. 7 (at Stanford) Demo/poster session followed by free beer.

Links to previous projects

Making CORBA Failure-Tolerant Through Redundancy (Tudor I. Har)
Fault Tolerant Satellite Ground Station Control System (James Cutler, Chuck Fraleigh, Devendra Jaisinghani)
Atomic and Ordered Messages over SRM (Jan Chong, Brad Kohn)
ALCOA specification for the Linux ext2fs filesystem (Kanna Shimizu)

Modified: 12 May 2002