The UC Berkeley/Stanford Recovery-Oriented Computing (ROC) Project

The Berkeley/Stanford
Recovery-Oriented Computing (ROC)
Project Winter 2002 Retreat

Slides
- ROC Introduction [pdf] - David Patterson
- Undo and the 3 R's [pdf] - Aaron Brown
- Dependability of Large-scale Internet Services - David Oppenheimer
- Failure Analysis of the PSTN [pdf] - Patty Enriquez
- OceanStore Intro - John Kubiatowicz
- Introspective Replica Management in OceanStore - Dennis Geels
- Accessing the Web Through OceanStore - Patrick Eaton
- Dynamic Multicast Tree Construction for the Second Tier - Puneet Mehra
- Access Control in OceanStore
- Introspection for Low-level Data Prefetching - Mark Whitney
- Tapestry Architecture and Status - Ben Zhao
- Dynamic Deletion Algorithms for Tapestry - Kris Hildrum
- Attenuated Bloom Filters for Routing - Sean Rhea
- Archival Management in OceanStore - Hakim Weatherspoon
- A Shared-disk Parallel Database Toolkit - Noah Treuhaft
- Minimizing Time to Recover in a Recursively Restartable System - George Candea
- Pinpoint: Automating Root Cause Analysis in J2EE - Mike Chen
- FIG: Fault Injection in Glibc - Naveen Sastry
Posters
- FIG: Fault Injection in Glibc
  Abstract: We believe there is a need for enhanced software tools that can test the reliability and recoverability of applications under system environment failures. We developed a lightweight, extensible software testing package that intercepts calls from applications to the operating system and injects errors to simulate system faults. We then used this tool to test the behavior of common UNIX applications under various failure scenarios.
- Dependability of Large-scale Internet Services
- Undo for Recovery: Approaches and Models
  Abstract: Motivated by the observation that human error is a major source of failures in large server systems, we introduce the notion of Undo as a mechanism to provide recovery from human-induced system failures. We define the "Three R's" (Rewind, Repair, and Replay), which constitute an undo paradigm based on the combination of time travel and repair; the 3R model is well-matched to the recovery demands of human-error-induced failures in server systems. We identify some of the challenges in creating a practical implementation of a 3R-Undo-based system, and finally present initial thoughts on system state models for undo-capable systems.
- E-mail Dependability Benchmarking
  Abstract: E-mail was originally conceived as a "best effort" service with little effort directed towards attaining 100% dependability. Despite these humble beginnings, e-mail has now become a mission-critical service. Spurred on by it's increasing ubiquity and increased expectations for reliability, 100% dependability is no longer just a desirable feature but rather a virtual requirement in modern day systems. Given e-mail was never designed with such a goal in mind, we seek to define metrics of dependability and create a dependability benchmark to allow comparison of e-mail dependability between different e-mail platforms. We present the motivations behind creating such a benchmark, potential dependability metrics, an idea on the structure of the benchmark, and a brief list of open questions and issues.
- Hamming Transcoder for Power Reduction on Internal Buses
  Abstract: In modern chip design, power has become a dominant concern. At the same time scaling trends have increased the importance of wires relative to logic. This suggests that one might use more sophisticated bus driver technology to reduce the power consumed in transporting information across chips. In this paper, we investigate the possibility of reducing the power consumed on internal buses by using a fixed length coding scheme and data prediction techniques. Our design is based on the premise that the information transmitted through such internal buses is compressible. We first explore a number of high-level algorithms for compressing the number of transistions on buses, then explore the design of a practical transcoder. Simulations using a modified SimpleScalar simulator and SPEC95 benchmarks shows an average of 46% savings in transitions on internal buses such as the reorder buffer and register file. To quantify actual power savings, we design a simple encoder/decoder circuit in a 0.18 micron process, extract it as a netlist, then simulate its behavior under SPICE.
- The OceanStore Introspection Layer
  Abstract: OIL describes motivation for introspection in OceanStore, and introduces a framework for reusable introspective components
- Pinpoint: Problem Determination in Large, Dynamic Internet Services
  Abstract: Traditional problem determination techniques rely on static dependency models that are difficult to generate accurately in today's large, distributed, and dynamic application environments such as
  e-commerce systems. In this paper, we present a dynamic analysis methodology that automates problem determination in these environments by 1) coarse-grained tagging of numerous real client requests as they
  travel through the system and 2) using data mining techniques to correlate the believed failures and successes of these requests to determine which component(s) are most likely to be at fault. To validateour methodology, we have implemented Pinpoint, a framework for root-cause analysis on the J2EE platform that requires no knowledge of the application components. Pinpoint consists of three parts: a communications layer that traces client requests, a failure detector that uses traffic-sniffing and middleware instrumentation, and a data analysis engine. We evaluate Pinpoint by injecting faults into various application components and show that Pinpoint identifies the faulty components with high accuracy and produces few false-positives.
- End-User Web Availability

Last Updated: 02/12/2004 09:21

The Berkeley/Stanford Recovery-Oriented Computing (ROC) Project Winter 2002 Retreat

The Berkeley/Stanford
Recovery-Oriented Computing (ROC)
Project Winter 2002 Retreat