12 May 2002
- Thu 10/18: in-class project proposal presentation (~3-4 minutes
each) and debugging
- 11/1 and 11/15: brief project progress meetings (in class)
- 11/22, 11/29: no class meetings (use the time to work on
- 12/6: Project conference talks (in class)
- Friday 12/7: Project poster/demo session at Stanford, followed by
TGIF (happy hour)
- 12/10: conference-quality papers due
Project Web Pages
Project Proposal Requirements
Your proposal must address the following issues - preferably
as bullet items or subheadings. One person from each group should
be prepared to potentially present the proposal in class, using one PPT
or transparency or just talking for 3-4 minutes max:
- What is the interesting research question/problem area being
addressed (be specific) Ideally this should be framed in the
form of a hypothesis: "We stipulate that the following is the
case..." - this can refer to measurements, the possibility that
something can be designed/built, etc. Your project will then
attempt to support or weaken that hypothesis.
- What is the approach/strategy (again, be specific... will you
measure something? what, and where? will you build a prototype in
order to measure it? can you convince us you have enough time?)
- Who has done similar things, and why is this an improvement or how
does it fill a gap? (i.e. you should definitely do a preliminary but
broad literature search before starting the project)
- What deliverables are expected (data collection + analysis,
working prototype, etc)
We will still help people debug proposals and where possible refine
the proposed approach, etc. We may reserve the right to assign projects
if people can't come up with well-circumscribed proposals that
satisfactorily address the above points.
Existing projects looking for more people:
None yet. If you have a project idea and are looking for a partner,
send a one-paragraph description to Aaron
Brown and it will be posted here.
Suggestions for general project themes:
These are some general themes that might make interesting projects in
the area of Recovery-Oriented Computing and system dependability. Most
of these themes will have to be narrowed down significantly to make
- Virtual machines for dependability. We've had several
discussions in the class about system virtualization techniques and
the ways that they might contribute to dependability, for example by
isolating applications from each other, by providing for more
efficient failover, by providing an environment for
fault-injection-based robustness verification, and by providing a
means of intercepting and analyzing low-level resource requests to
detect abnormal behavior. Possible projects might involve selecting
one or more of these ideas and implementing/evaluating them in the
context of a real virtual machine environment.
- Dependability metrics and benchmarking techniques. A key
component to the success of the ROC project (and indeed of any
research in Internet systems dependability) lies in quantifying the
dependability gains achieved through ROC techniques. This requires
representative, reproducible, and widely-applicable metrics, as well
as techniques for measuring them. There are several possible
projects in this area, including following up on the earlier work in
this area by members of the ROC project and the IFIP Dependability
Benchmarking working group, and in devising dependability metrics
and benchmarks for systems that haven't yet been analyzed for
- Human error, maintainability, and dependability. A
theme we've brought up throughout the course is the importance of
human beings in system dependability. There has been a lot of work
on human factors in safety-critical systems, but very little
consideration of humans in the administration/maintenance side of
Internet-service-type systems. Possible projects in this area might
include developing and carrying out pilot human studies with
existing systems in order to quantify the kinds of mistakes people
make while administering them, attempting to build models of
administrator error that could be used to drive automated
maintainability benchmarks, or developing new administration models
that are more forgiving human error.
- Case studies. A major problem plaguing dependability
research is the lack of data on real system failures. A valuable
project would be to take a real, deployed system and perform a case
study on its failure modes and their root causes. This could be done
in cooperation with a commercial site (if access to data is
available) or by studying particular systems at Stanford or Berkeley
(such as Berkeley's long-suffering NetApp filer), among other
- Fault simulation and injection. Several of the techniques
proposed for ROC systems rely on a means of simulating or generating
realistic system failures. Unfortunately, most existing work on
failure simulation has focused on very low-level hardware
fault-injection (i.e., flipping bits in the processor or
memory). What's needed is a system for injecting more realistic
high-level faults, capturing the failure modes of peripherals (like
hard drives), networks, device drivers, and software components.
Especially useful would be a software framework for
generating/simulating these failures--this could be used for
dependability benchmarking as well as ROC's proactive verification
techniques. One or more projects could be defined to develop all or
part of such a framework, perhaps in the context of a virtual
- ROC design techniques. The ROC project has identified many
possible system design techniques that might increase dependability
(see Aaron's quals
talk or the ROC
overview talk for a list). These are the source of many possible
projects: pick an interesting technique and implement it in the
system of your choice, then evaluate that system for dependability
- Ground-station related projects. <insert description>
- [2 unit option] Thorough survey paper on systems dependability
techniques. As we have lamented several times in lecture, there
are no good all-encompassing survey papers on dependability
techniques at a systems level. Writing one would be a great 2-unit
project and would also be a great service to the community.
Specific project suggestions:
The following are suggestions for specific projects that have been
proposed by the class staff and industry experts connected to the ROC
- [Brendan Murphy] A project that springs to mind is
developing tools to measure the availability of a web site. This
appears to be relatively easy but you have to take into
consideration how to handle delayed responses (i.e. what is your
definition of up is it a response in 5 seconds and if you need a
retry does that mean it is degraded?). As your course is a joint
Stanford/Berkeley you have the opportunity of monitoring the
availability of a web site internally and externally (i.e. monitor a
Berkeley web site within your firewall and also from Stanford i.e.
outside your firewall) and compare the results. The classic way of
doing this is to have a web page with a defined response (so the
monitor issue a defined URL and validates the complete response,
note simply pinging is not sufficient) so you can check if you
receive a response and the response is the total page. The
setting up of the project requires the people to think through the
problem but it is unlikely in 3 weeks that they will get any data
BUT if you leave it running, it allows follow on projects to try to
come up with availability and reliability measurements.
- [Jim Gray] Benchmarking restart times... Yes, you can measure OS,
spooler, RAID, cluster failover, FileSystem, DBMS, WebServer,......
You can measure MTTR (very important metric) and measure the tail
(needs human intervention) and measure success rate (what % of data
and work lost). Since unavailbity is MTTR/MTTF and since we do not
know how to make MTTF infinite, we want to try to drive MTTR to
zero. This is about measuring how close to zero we are, and what the
distribution looks like.
- [James Cutler] Apply ROC to the Federated Network of Ground
Stations (FNG) being deployed by Stanford's Space Systems
Development Lab. Specifically, apply three kinds of
monitoring--component checks, end-to-end checks, and
cross-checks--to monitor performance of one or more FNG nodes and
detect failures. (This problem is potentially interesting
because it involves the use of separate sites for cross-checking to
detect a failure; i.e. a failure at site A can be confirmed/inferred
by looking at data from site B. This idea should also be
applicable to replicated Web services.) Goal of project would
be to extract lessons from this case study that could form basis of
a methodology for applying ROC.
- [George Candea] Apply ROC to Stanford Interactive Workspace; same
general principles as #3.
- [Aaron Brown and Jim Gray] Database availability benchmarking.
This project would be a follow-on to earlier work by Aaron on using
fault-injection to quantify the availability of a three-tier
database (see this
paper). The earlier work only looked at disk faults in the
database and end-user-perceived performance as a measure of
availability. It would be very interesting to extend this work to
include the injection of faults into more components of the system
(including the middleware and front-end), and to look at more
detailed dependability metrics for the system.
- [Aaron Brown] Conservation laws and dependability. Physical
systems (those that deal with tangible entities like fluids, gasses,
etc.) behave according to the laws of physics. In particular, there
are physical conservation laws that make it possible to detect
complex, hidden problems through simple analyses of material flow,
production, and destruction. For example, a leak in a power plant
can be detected by monitoring the coolant flow at various points and
noticing unexpected differences.
An interesting question is whether similar conservation-law-analysis
techniques could be applied to computer systems by treating data as
a conserved entity. Are there conservation laws that make sense for
data in Internet systems? It seems that the answer is yes--for
example, a piece of incoming email is a set of data that should be
conserved as it traverses the system from SMTP socket to mail router
to storage to IMAP server, with other conservation laws influencing
headers and MIME parts.
This project would investigate the utility of conservation laws
for enhancing system dependability by detecting latent errors
and problems before they become significant. You would pick a
system, devise conservation laws, implement monitors to detect
non-conservation, then demonstrate that your system detects
- [Aaron Brown] Using hacker techniques to improve maintainability.
System administrators bemoan the lack of tools to effectively
administer and maintain large clusters of machines, especially for
Windows machines. Hackers, however, seem to have no trouble
harnessing and coordinating huge and far-flung groups of machines to
perform denial of service attacks. A possible project is to study
the techniques used by hackers, worms, viruses, etc., and figure out
if and how they can be co-opted for more productive uses,
particularly in improving dependability or maintainability.
- [Pedram Keyani, Brian Larson, Muthu Senthil] Failures in P2P
systems and recovering from malicious or malfunctioning nodes, worms
and defining and maintaining metrics for failures in distributed
systems. The term failure has not been fully defined for pure P2P
systems yet. We feel it is important to define metrics to describe
availability of P2P systems. In such a non-cooperative environment
what is important to each node is that their individual needs are
satisfied. With this in mind we want to explore different ways in
which local failure policies can be applied to benefit the entire
system. Nodes should have the capability to collect and maintain
metrics about the behavior of their neighbors and react accordingly.
Metrics could include such things as number of files shared,
bandwidth, queries handled, average time online and other metrics
which we are still discussing.
Some ideas from past iterations:
- Robustification library. What could you put in it? How would you express the operations
it can do in such a way that applications can pick their recovery policies? Should it be
kernel code or a user level library? Would it be truly orthogonal (like a sandbox) or
quasi-orthogonal (like a debug-malloc library)? For some good ideas about this,
(re-)read Dawson Engler's paper on
Interface Compilation. One example along these lines: most Unix apps don't
error-check the results of system calls (fopen, malloc, etc.). Annotate program
blocks to indicate how to recover in these cases, in case the programs don't have their
own checking/recovery code. Use a tool like Janus to intercept the system calls and
implement the behaviors. (As another motivation, Janus often protects against unsafe
behavior by causing a system call to fail. A robustification library might allow
this to occur and still have the program keep running in a nice way.)
- Simulator that generates "directed random" failures, either as a source of
stimulation or as a SimOS-type library that apps can be run on top of for testing.
(Alternatively, find one and use it for the project.)
- Generic, templatable highly-available soft-state server with well-defined failure and
- Implementing global assertions. A set of redundant mechanisms -- within each level
of abstraction, and at different levels of the implementation -- that continuously check
global assertions, and provision for calling (one of) a set of "assertion failed
handlers", possibly with additional state as arguments, when the assertions fail.
Could this be done using polling? Interrupts/notification? How could existing
programming tools be augmented to include support for global assertion checking?
- Daemon robustifier. Parameterized monitoring, and parameterized restart/reinit,
for daemons. End-to-end operational checks, restarts with limit count, etc.
End-to-end check must verify liveness and correct operation, not just "upness"
in the ps sense. Restartability must be reasonable in the face of several
conditions: (a) maybe the problem is really elsewhere, like in the network.
case you must limit the number of restarts so you don't consume resources forever.
(b) Maybe the problem is in some other, lower-level component, e.g. maybe the network
interface isn't working at all, so all network programs will appear to be broken.
this case, could the lower-level component(s) be recursively protected?
etc. Similarly, restarting may have some constraints. For example, pppd requires
access to the serial port, so starting one when a wedged one is running won't work - the
wedged one must be killed completely or no restarts will work. So you'd need a way
to verify that something is really gone. The tricky case involves processes that use
TCP connections: even after the process has been killed, the connection it used may be
stuck in the FIN_WAIT state (a well known problem) preventing any proceses from binding to
that TCP port until after a 3-minute (?) timeout has passed. The recovery/monitoring
code must be aware of circumstances like this in order to act intelligently.
can do nothing, it should try to notify a human via email or pager (how can it check that
this worked?). Bonus question: who watches the watcher? i.e. what if the
daemon-watcher itself fails? Could you run two of them and have them watch each
other, as was done with process peers in Berkeley SNS/TACC?
Current enrollment list:
|George Candea (staff)
||Aaron Brown (staff)
Eventually the above table will morph into a list of project teams.
|Evaluation Criteria and
- Dec. 10 Conference-quality project report. We expect
some of them to be submitted as conference papers.
- Dec. 6 (in class) Short (15-min.) project talk.
- Dec. 7 (at Stanford) Demo/poster session followed by free
|Links to previous projects