The Recovery-Oriented Computing (ROC) project is a joint Berkeley/Stanford
research project that is investigating novel techniques for building highly-dependable
Internet services. In a significant divergence from traditional
fault-tolerance approaches, ROC emphasizes recovery from failures rather
than failure-avoidance. This philosophy is motivated by the observation that
even the most robust systems still occasionally encounter failures due to human
operator error, transient or permanent hardware failure, and software anomalies
resulting from "Heisenbugs" or software aging.
The ROC approach takes the following three assumptions as its basic tenets:
- failure rates of both software and hardware are non-negligible and
- systems cannot be completely modeled for reliability analysis, and thus
their failure modes cannot be predicted in advance
- human error by system operators and during system maintenance is a major
source of system failures
These assumptions, while running counter to most existing work in dependable
and fault-tolerant systems, are all strongly supported by field evidence from
modern production Internet service environments.
ROC Research Areas
The assumptions listed above provide a broad philosophy for guiding the
design of ROC systems. From this philosophy, we have identified several more
concrete research areas that fall under the ROC umbrella. Each of these areas
defines one of the important qualities that must be provided by a truly
recovery-oriented computing system.
- Isolation and Redundancy. A key part of any recovery-oriented
computing system is the ability to isolate portions of the system. Isolation
is crucial for fault containment and safe online recovery, and it is an
enabler for many of the diagnostic and verification techniques described
below. Isolation naturally demands redundancy, as redundancy allows
continued service delivery while portions of the system are isolated.
Isolation and redundancy in ROC systems must go beyond traditional
approaches. Because the ROC philosophy assumes that obscure and unexpected
faults may occur, isolation must be failure-proof under a broad failure
model, including all software and human-induced failures.
investigating hardware support for isolation as well as robust software
isolation using virtual machine monitors, software sandboxing,
hardware-assisted sandboxing (e.g. using MMU hardware), and
- System-wide support for Undo. Data and experience show that human
error is the largest single cause of failures and outages in modern server
systems, and results from psychology research point out that human error is
intrinsic and can never be completely eliminated. Most productivity
applications recognize the inevitability of human error and provide undo facilities
that allow the human user to recover from their errors. However, such
facilities are rarely if ever provided for system maintenance: system
operators are expected to perform complex tasks with potentially enormous
impact on the system without the safety of an undo mechanism should they
slip or make an incorrect decision. Furthermore, the lack of an undo
facility for system maintenance denies the effective diagnosis and learning
process of trial-and-error investigation.
We believe that ROC systems must provide an undo facility that covers all
aspects of system operation, from system configuration to application
management to software and hardware upgrades. The undo facility must provide
a way to "repair the past" as well as simply unwind time; we think
of undo as a three-step process of rewinding time, untangling
problems, then replaying the system back to the current time. Clearly there are limits to
what types of actions can be undone; the goals of our research are to
explore those limits, to identify the cost/benefit tradeoffs in choosing the
scope of undo, and to define an undo model that is practical and that
significantly improves the dependability of human-operated systems.
- Integrated Diagnostic Support. For the ROC approach to be
successful, recovery must be swift and efficient. It follows that a ROC
system must rapidly detect the presence of failures and identify their root
causes so that they may be quickly repaired or contained. Furthermore,
latent errors must be unmasked before they are allowed to build up and cause
catastrophic chain-reaction failures, a surprisingly common failure mode in
supposedly fault-tolerant systems.
We believe that these goals can be achieved by integrating diagnostic
support throughout the system in the form of self-testing and automated
root-cause analysis. All modules in a ROC system should be self-testing, and
should verify the behavior of all other modules that they depend upon. We
are strong proponents of online testing, in which test inputs and
even faulty inputs are purposefully inserted into running production systems
to verify their proper operation (see also the discussion below on
verification of recovery mechanisms). Besides being self-testing, the
components of a ROC system should cooperate to track dependencies between
modules, resources, and user requests, as these dependencies provide
valuable information for human diagnosticians and automated root-cause
analysis approaches. Our research in this area includes the design of
testing interfaces and frameworks for system components, software
verification approaches, and root-cause analysis algorithms.
- Online Verification of Recovery Mechanisms. By nature, a ROC
system relies on its recovery mechanisms to provide dependable,
highly-available service. Regardless of what those recovery mechanisms might
be (isolation and redundancy, undo, proactive restarts, and so on), it is
important that the mechanisms be reliable, effective, and efficient.
Practical experience and anecdotal evidence speak to the fact that many
real-world failures are created or compounded by non-functional repair or
warning systems; such situations should not occur in ROC systems.
Thus, to avoid reliance on buggy, inefficient, or incomplete recovery
mechanisms, ROC systems should proactively test and verify the proper
behavior of their recovery mechanisms. Verification should consist of both
directed and random tests, with realistic system-level faults inserted as
perturbations. As with the self-testing used for diagnosis, verification of
recovery mechanisms must be performed online, even in production
systems. Research issues here include devising fault injection strategies
that properly exercise recovery mechanisms, developing measurable standards
for correct and efficient recovery, and integrating
recovery-mechanism-verification with isolation so that it can be safely
deployed in production environments.
- Design for high modularity, measurability, and restartability. "Software
aging" related problems, such as memory arena corruption, very
complicated and difficult-to-reproduce timing-related concurrency bugs
("Mandelbugs"), etc. are often best solved by a total or partial
restart of the affected components. In some cases, proactively
restarting components before they fail can improve overall
availability; most clustered Internet services already do this. We are
investigating what techniques can be used to (re)structure applications to
make them amenable to "design for restartability"; issues that
must be addressed include state management, detecting unexpected
interactions among ensembles of coordinated components, determining which
end-to-end and component-level checks should be used to reliably detect and
infer failures, and the extent to which statistical monitoring techniques
can be applied.
- Dependability/Availability Benchmarking. The goal of
Recovery-Oriented Computing is to improve system dependability. To evaluate
our progress in developing ROC systems and to compare the results with
existing systems, we must have benchmarks that provide a reproducible,
impartial measure of system dependability.
We are developing standard dependability benchmarks that use the injection
of system-level faults and perturbations to evaluate the impact of realistic
failures on delivered quality of service. Part of our research consists of
collecting data on faults and failure modes from real Internet service
environments; we intend to distill this collected data into a
publicly-available fault model for Internet services. We are also
investigating the definition of metrics for dependability, and considering
how to best-incorporate human behavior into our dependability benchmarks.
Contact: roc-group at cs.berkeley.edu.
Last modified on 03-Nov-2004 21:54:22 -0800