Recovery-Oriented Computing (ROC)
Project Winter 2002 Retreat
- FIG: Fault Injection in Glibc
Abstract: We believe there is a need for enhanced software tools that can test the reliability and recoverability of applications under system environment failures. We developed a lightweight, extensible software testing package that intercepts calls from applications to the operating system and injects errors to simulate system faults. We then used this tool to test the behavior of common UNIX applications under various failure scenarios.
- Dependability of
Large-scale Internet Services
- Undo for Recovery: Approaches
Abstract: Motivated by the observation that human error is a major source of failures in large server systems, we introduce the notion of Undo as a mechanism to provide recovery from human-induced system failures. We define the "Three R's" (Rewind, Repair, and Replay), which constitute an undo paradigm based on the combination of time travel and repair; the 3R model is well-matched to the recovery demands of human-error-induced failures in server systems. We identify some of the challenges in creating a practical implementation of a 3R-Undo-based system, and finally present initial thoughts on system state models for undo-capable systems.
- E-mail Dependability Benchmarking
Abstract: E-mail was originally conceived as a "best
effort" service with little effort directed towards attaining 100%
dependability. Despite these humble beginnings, e-mail has now become a
mission-critical service. Spurred on by it's increasing ubiquity and
increased expectations for reliability, 100% dependability is no longer just
a desirable feature but rather a virtual requirement in modern day systems.
Given e-mail was never designed with such a goal in mind, we seek to define
metrics of dependability and create a dependability benchmark to allow
comparison of e-mail dependability between different e-mail platforms. We
present the motivations behind creating such a benchmark, potential
dependability metrics, an idea on the structure of the benchmark, and a
brief list of open questions and issues.
- Hamming Transcoder for
Power Reduction on Internal Buses
Abstract: In modern chip design, power has become a dominant concern. At the same time scaling trends have increased the importance of wires relative to logic. This suggests that one might use more sophisticated bus driver technology to reduce the power consumed in transporting information across chips. In this paper, we investigate the possibility of reducing
the power consumed on internal buses by using a fixed length coding scheme and data prediction techniques. Our design is based on the premise that the information transmitted through such internal buses is compressible. We first explore a number of high-level algorithms for compressing the number of transistions on buses, then explore the design of a practical transcoder. Simulations using a modified SimpleScalar simulator and SPEC95 benchmarks shows an average of 46% savings in
transitions on internal buses such as the reorder buffer and register file. To quantify actual power savings, we design a simple encoder/decoder circuit in a 0.18 micron process, extract it as a netlist, then simulate its behavior under SPICE.
- The OceanStore Introspection Layer
Abstract: OIL describes motivation for introspection in
OceanStore, and introduces a framework for reusable introspective components
Problem Determination in Large, Dynamic Internet Services
Abstract: Traditional problem determination techniques rely on static dependency
models that are difficult to generate accurately in today's large, distributed, and dynamic application environments such as
e-commerce systems. In this paper, we present a dynamic analysis methodology that automates problem determination in these environments by 1)
coarse-grained tagging of numerous real client requests as they
travel through the system and 2) using data mining techniques to correlate the believed failures and successes of these requests to
determine which component(s) are most likely to be at fault. To validateour methodology, we have implemented Pinpoint, a framework for root-cause analysis on the J2EE platform that requires no
knowledge of the application components. Pinpoint consists of three parts: a communications layer that traces
client requests, a failure detector that uses traffic-sniffing and middleware instrumentation, and a data analysis engine. We
evaluate Pinpoint by injecting faults into various application components and show that Pinpoint identifies the faulty components
with high accuracy and produces few false-positives.
- End-User Web Availability
Last Updated: 02/12/2004 09:21