Embracing Failure: Availability via Recovery-Oriented Computing (ROC) 
Aaron Brown, UC Berkeley
Abstract: Motivated by the lack of rapid improvement in the availability of Internet server systems, we argue for a new approach to building highly-available systems that better reflects the realities of the modern server environment, namely that failures of hardware, software, and humans are inevitable. Our approach, denoted Recovery-Oriented Computing (ROC), recognizes the inevitability of unanticipated failure and thus emphasizes recovery and repair rather than simple fault-tolerance. Our proposal is unique in that it tackles the human aspects of availability along with the traditional system aspects. We propose a set of design principles for building ROC systems, identify the target for our first implementation of ROC, and describe how we intend to quantitatively evaluate the availability gains achieved by recovery-oriented computer design.