After 15 years of successfully improving cost-performance, its time
for
new challenges for the systems research community.
As a result of the focus on cost-performance, the fabled five 9s of
availability looks to be much easier to achieve on billboards than
in computers, and the managing systems with
state can be ten times the cost of the equipment. In a PostPC Era of
wireless gadgets using services on the
Internet, one new challenge is building services that really are dependable
and much less expensive to maintain.
Traditional Fault Tolerant Computing concentrates on tolerating hardware
and operating system faults, ignoring faults by human operators and
even applications. Recovery
Oriented Computing (ROC) aims at improving Mean Time To Recover to
both lower the cost of management and
improve at the availability of whole system, including the people who
operate it. We look to civil
engineering and studies of disasters to inspire principles for ROC
design.
This talk outlines our tentative research agenda and proposed principles
of ROC design, plus some concrete results in the area of bench marking
of availability and first
pass design of the hardware of server built along these lines, called
ROC-I.