CS 294-4/CS 444a: Recovery-Oriented Computing

Summary of Breakout Session, 9/27/01

Lots of questions...

What is Undo good for? Are we trying to recover from software Bugs v. human (e.g., Admin) errors? Related to this, we might want to keep track of human-generated v. computer-generated data
We need to draw a boundary around the system (or parts of the system) where we are attempting to redo. Anything outside of this scope, we won't attempt to undo.
But, do we want to try to do compensating actions for things outside of our immediate undo control.
Should come up with a way to (at least) present information relevant to the undo action to the user (such as dependencies, history information) E.g., undoing this will remove some 500 files that were created, ... ask the user what they want to do with them?
Where is Undo applicable? Workstations, Internet service clusters?

Easier to verify
App must go through approved interfaces
- Verify coverage of interfaces
- Would like clean OS, etc. interfaces
Treat OS as large, poorly debugged, library that everyone shares

How to deal with persistent state
Yahoo group maintain interface for persistence
- Write stateless code, use this interface for persistence
- Other kinds of APIs at app level that we don't currently have?
What do you need for a clean restart?

One definition for web services: probability of success (SLA, typically response time)

Web (e-commerce, search engine), email, .NET, IM
Identify common faults (faulty workload) and failures (symptoms, for verification)
define unavailability and data quality

How much to report?
How to compress/abstract information?
How to know what are the important bits?
Tension comes from the fact that we may not know, a priori, what we want to look for: As a consequence, everything needs to be observable

No
Must test and ship same system configuration. If we test with diagnostics, we must ship with diagnostics
With diagnostics always on, we can collect data on "normal" operation so we can use statistical techniques to identify outliers or to classify normal/abnormal system states/behavior.

Only so much one can observe/learn from local analysis of a single module or system.
Far more can be learned from correlating events across modules or system installations. e.g. We are more likely to discover that disks fail in batches if we correlate failures across many installations that only within single installations.

Interesting example of extreme black-box diagnostics: Some IBM machines send "symptom string" back home in the event of a failure. The symptom string is some slice of machine state (e.g. registers). A database of previous symptoms and solutions can aid diagnosis.

How can we create diagnostics that are cheap enough to leave in production systems?
How much does the diagnostic interface of components need to be standardized if we want to have external control of diagnostic functions and correlate events across different modules?
How well can we leverage the network effect of collecting diagnostics from multiple installations to compute higher-order statistics not possible with only a local view?

Modified: 12 May 2002