A Recovery-Oriented Approach to Dependable Services: Repairing Past Errors With System-Wide Undo [pdf]
Aaron B. Brown
UC Berkeley Computer Science Division Technical Report UCB//CSD-04-1304, December 2003.

Abstract

Motivated by the pressing need for increased dependability in corporate and Internet services and by the perspective that effective recovery can improve dependability as much or more than avoiding failures, we introduce a novel recovery mechanism that gives human system operators the power of system-wide undo. System-wide undo allows operators to roll back erroneous changes to a service.s state without losing end-user data or updates, to make retroactive repairs in the historical timeline of the service system, and thereby to quickly recover from catastrophic state corruption, operator error, failed upgrades, and external attacks, even when the root cause of the catastrophe is unknown.

We explore system-wide undo via a framework based on the novel concept of spheres of undo, bubbles of state and time that provide scope to the state recoverable by undo and serve as a structuring tool for implementing undo on standalone services, hierarchically-composed systems, and distributed interacting services. Crucially, spheres of undo allow us to define the concept of paradoxes, inconsistencies that occur when an undo process retroactively alters state that has been exposed outside of its containing sphere of undo. Managing paradoxes is the grand challenge of system-wide undo, and to tackle it we introduce a framework that automatically detects and compensates for paradoxes; our approach exploits the relaxed consistency semantics already present in existing services that interact with human end-users, and can provide flexible recovery guarantees ranging from local compensation to lose-no-work to intent-soundness.

We describe an implementation of our system-wide undo framework for standalone services with human end-users. We explore its applicability by assembling and evaluating a prototype undoable e-mail store service, by analyzing what would be necessary to construct an undoable online auction service, and by developing a set of guidelines to help service designers retrofit their services with undo. We find that system-wide undo functionality imposes non-negligible but tolerable overhead in terms of both time and space. Using a novel methodology we develop to benchmark human-assisted recovery processes, we also find that undo-based recovery has a net positive effect on dependability, providing significant improvements in correctness while only slightly degrading availability.