|
Summary
of Breakout Session, 9/27/01 |
Group 1: Undo
Lots of questions...
- What is Undo good for? Are we trying to recover from software Bugs
v. human (e.g., Admin) errors? Related to this, we might want
to keep track of human-generated v. computer-generated data
- We need to draw a boundary around the system (or parts of the
system) where we are attempting to redo. Anything outside of this
scope, we won't attempt to undo.
- But, do we want to try to do compensating actions for things
outside of our immediate undo control.
- Should come up with a way to (at least) present information
relevant to the undo action to the user (such as dependencies,
history information) E.g., undoing this will remove some 500 files
that were created, ... ask the user what they want to do with them?
- Where is Undo applicable? Workstations, Internet service clusters?
Group 2: Virtualization and restartability
Checkpointing
- Different levels
- Higher level hard to capture state – no good abstractions
- But if you have them, easier to roll back to
an app-level checkpoint
VMs
- Easier to verify
- App must go through approved interfaces
- Verify coverage of interfaces
- Would like clean OS, etc. interfaces
- Treat OS as large, poorly debugged, library that everyone shares
Modular restart
- How to deal with persistent state
- Yahoo group maintain interface for persistence
- Write stateless code, use this interface for persistence
- Other kinds of APIs at app level that we don't currently have?
- What do you need for a clean restart?
Group 3: Benchmarking and metrics
Problem: “unavailability” is not well defined
- No longer binary -- what’s the objective function?
- 2 min @ 9am != 2 min @ 1am
- Premier users != basic users
- A stock trading transactions != a price quote
- One definition for web services: probability of success (SLA, typically response time)
- e.g. FCC: blocked calls => blocked calls/all calls
- What about data quality (graceful
degradation)?
How to make progress
- Pick a service
- Web (e-commerce, search engine), email, .NET, IM
- Identify common faults (faulty workload) and failures (symptoms, for
verification)
- define unavailability and data quality
- e.g. 90% data is 50% as useful
- Look at existing performance benchmarks
- are they good availability benchmarks?
- e.g. SPECWeb99
- closed-loop -- request rate affected by failures/TCP interactions
- time resolution on the order of minutes
Group 4: Diagnostics and on-line testing
Issues
- Verbosity vs. Abstraction in diagnostic reports
- How much to report?
- How to compress/abstract information?
- How to know what are the important bits?
- Tension comes from the fact that we may not know, a priori, what we
want to look for: As a consequence, everything needs to be observable
- Should diagnostics ever be turned off?
- No
- Must test and ship same system configuration. If we test with
diagnostics, we must ship with diagnostics
- With diagnostics always on, we can collect data on "normal"
operation so we can use statistical techniques to identify outliers or to classify normal/abnormal system states/behavior.
- Local vs. Global event correlation
- Only so much one can observe/learn from local analysis of a single module or system.
- Far more can be learned from correlating events across modules or system installations. e.g. We are more likely to discover that disks
fail in batches if we correlate failures across many installations that
only within single installations.
- Internal vs. External testing [of components]
- Related to clear-box vs. black-box testing
- Internal testing can aid fault containment and effect fail-stop behavior
- External testing required for end-to-end guarantees
- External control of diagnostics and self-testing useful for:
- If we want a module to do expensive testing only when not loaded
- For human operators diagnosing system problems
- Interesting example of extreme black-box diagnostics: Some IBM machines send "symptom string" back home in the event of a
failure. The symptom string is some slice of machine state (e.g. registers). A
database of previous symptoms and solutions can aid diagnosis.
Research Topics
- How can we create diagnostics that are cheap enough to leave in production systems?
- How much does the diagnostic interface of components need to be standardized if we want to have external control of diagnostic
functions and correlate events across different modules?
- How well can we leverage the network effect of collecting
diagnostics from multiple installations to compute higher-order statistics not
possible with only a local view?
|