Recovery-Oriented Computing

Overview

The Recovery-Oriented Computing (ROC) project is a joint Berkeley/Stanford research project that is investigating novel techniques for building highly-dependable Internet services. In a significant divergence from traditional fault-tolerance approaches, ROC emphasizes recovery from failures rather than failure-avoidance. This philosophy is motivated by the observation that even the most robust systems still occasionally encounter failures due to human operator error, transient or permanent hardware failure, and software anomalies resulting from "Heisenbugs" or software aging.

The ROC approach takes the following three assumptions as its basic tenets:

failure rates of both software and hardware are non-negligible and increasing
systems cannot be completely modeled for reliability analysis, and thus their failure modes cannot be predicted in advance
human error by system operators and during system maintenance is a major source of system failures

These assumptions, while running counter to most existing work in dependable and fault-tolerant systems, are all strongly supported by field evidence from modern production Internet service environments.

ROC Research Areas

The assumptions listed above provide a broad philosophy for guiding the design of ROC systems. From this philosophy, we have identified several more concrete research areas that fall under the ROC umbrella. Each of these areas defines one of the important qualities that must be provided by a truly recovery-oriented computing system.

Isolation and Redundancy. A key part of any recovery-oriented computing system is the ability to isolate portions of the system. Isolation is crucial for fault containment and safe online recovery, and it is an enabler for many of the diagnostic and verification techniques described below. Isolation naturally demands redundancy, as redundancy allows continued service delivery while portions of the system are isolated.

Isolation and redundancy in ROC systems must go beyond traditional approaches. Because the ROC philosophy assumes that obscure and unexpected faults may occur, isolation must be failure-proof under a broad failure model, including all software and human-induced failures.

We are investigating hardware support for isolation as well as robust software isolation using virtual machine monitors, software sandboxing, hardware-assisted sandboxing (e.g. using MMU hardware), and application-specific languages.
System-wide support for Undo. Data and experience show that human error is the largest single cause of failures and outages in modern server systems, and results from psychology research point out that human error is intrinsic and can never be completely eliminated. Most productivity applications recognize the inevitability of human error and provide undo facilities that allow the human user to recover from their errors. However, such facilities are rarely if ever provided for system maintenance: system operators are expected to perform complex tasks with potentially enormous impact on the system without the safety of an undo mechanism should they slip or make an incorrect decision. Furthermore, the lack of an undo facility for system maintenance denies the effective diagnosis and learning process of trial-and-error investigation.

We believe that ROC systems must provide an undo facility that covers all aspects of system operation, from system configuration to application management to software and hardware upgrades. The undo facility must provide a way to "repair the past" as well as simply unwind time; we think of undo as a three-step process of rewinding time, untangling problems, then replaying the system back to the current time. Clearly there are limits to what types of actions can be undone; the goals of our research are to explore those limits, to identify the cost/benefit tradeoffs in choosing the scope of undo, and to define an undo model that is practical and that significantly improves the dependability of human-operated systems.
Integrated Diagnostic Support. For the ROC approach to be successful, recovery must be swift and efficient. It follows that a ROC system must rapidly detect the presence of failures and identify their root causes so that they may be quickly repaired or contained. Furthermore, latent errors must be unmasked before they are allowed to build up and cause catastrophic chain-reaction failures, a surprisingly common failure mode in supposedly fault-tolerant systems.

We believe that these goals can be achieved by integrating diagnostic support throughout the system in the form of self-testing and automated root-cause analysis. All modules in a ROC system should be self-testing, and should verify the behavior of all other modules that they depend upon. We are strong proponents of online testing, in which test inputs and even faulty inputs are purposefully inserted into running production systems to verify their proper operation (see also the discussion below on verification of recovery mechanisms). Besides being self-testing, the components of a ROC system should cooperate to track dependencies between modules, resources, and user requests, as these dependencies provide valuable information for human diagnosticians and automated root-cause analysis approaches. Our research in this area includes the design of testing interfaces and frameworks for system components, software verification approaches, and root-cause analysis algorithms.
Online Verification of Recovery Mechanisms. By nature, a ROC system relies on its recovery mechanisms to provide dependable, highly-available service. Regardless of what those recovery mechanisms might be (isolation and redundancy, undo, proactive restarts, and so on), it is important that the mechanisms be reliable, effective, and efficient. Practical experience and anecdotal evidence speak to the fact that many real-world failures are created or compounded by non-functional repair or warning systems; such situations should not occur in ROC systems.

Thus, to avoid reliance on buggy, inefficient, or incomplete recovery mechanisms, ROC systems should proactively test and verify the proper behavior of their recovery mechanisms. Verification should consist of both directed and random tests, with realistic system-level faults inserted as perturbations. As with the self-testing used for diagnosis, verification of recovery mechanisms must be performed online, even in production systems. Research issues here include devising fault injection strategies that properly exercise recovery mechanisms, developing measurable standards for correct and efficient recovery, and integrating recovery-mechanism-verification with isolation so that it can be safely deployed in production environments.
Design for high modularity, measurability, and restartability. "Software aging" related problems, such as memory arena corruption, very complicated and difficult-to-reproduce timing-related concurrency bugs ("Mandelbugs"), etc. are often best solved by a total or partial restart of the affected components. In some cases, proactively restarting components before they fail can improve overall availability; most clustered Internet services already do this. We are investigating what techniques can be used to (re)structure applications to make them amenable to "design for restartability"; issues that must be addressed include state management, detecting unexpected interactions among ensembles of coordinated components, determining which end-to-end and component-level checks should be used to reliably detect and infer failures, and the extent to which statistical monitoring techniques can be applied.
Dependability/Availability Benchmarking. The goal of Recovery-Oriented Computing is to improve system dependability. To evaluate our progress in developing ROC systems and to compare the results with existing systems, we must have benchmarks that provide a reproducible, impartial measure of system dependability.

We are developing standard dependability benchmarks that use the injection of system-level faults and perturbations to evaluate the impact of realistic failures on delivered quality of service. Part of our research consists of collecting data on faults and failure modes from real Internet service environments; we intend to distill this collected data into a publicly-available fault model for Internet services. We are also investigating the definition of metrics for dependability, and considering how to best-incorporate human behavior into our dependability benchmarks.

Contact: roc-group at cs.berkeley.edu. Last modified on 03-Nov-2004 21:54:22 -0800