Summary of Breakout Session, 9/27/01

Group 1: Undo

Lots of questions...

  • What is Undo good for? Are we trying to recover from software Bugs v. human (e.g., Admin) errors? Related to this, we might want to keep track of human-generated v. computer-generated data
     
  • We need to draw a boundary around the system (or parts of the system) where we are attempting to redo. Anything outside of this scope, we won't attempt to undo.
     
  • But, do we want to try to do compensating actions for things outside of our immediate undo control.
     
  • Should come up with a way to (at least) present information relevant to the undo action to the user (such as dependencies, history information) E.g., undoing this will remove some 500 files that were created, ... ask the user what they want to do with them?
     
  • Where is Undo applicable? Workstations, Internet service clusters?

Group 2: Virtualization and restartability

Checkpointing

  • Different levels
    • Higher level hard to capture state – no good abstractions
    • But if you have them, easier to roll back to an app-level checkpoint

VMs

  • Easier to verify
  • App must go through approved interfaces
    • Verify coverage of interfaces
    • Would like clean OS, etc. interfaces
  • Treat OS as large, poorly debugged, library that everyone shares

Modular restart

  • How to deal with persistent state
  • Yahoo group maintain interface for persistence
    • Write stateless code, use this interface for persistence
    • Other kinds of APIs at app level that we don't currently have?
  • What do you need for a clean restart?

Group 3: Benchmarking and metrics

Problem: “unavailability” is not well defined

  • No longer binary -- what’s the objective function?
    • 2 min @ 9am != 2 min @ 1am
    • Premier users != basic users
    • A stock trading transactions != a price quote
  • One definition for web services: probability of success (SLA, typically response time)
    • e.g. FCC: blocked calls => blocked calls/all calls
  • What about data quality (graceful degradation)?

How to make progress

  • Pick a service
    • Web (e-commerce, search engine), email, .NET, IM
    • Identify common faults (faulty workload) and failures (symptoms, for verification)
    • define unavailability and data quality
      • e.g. 90% data is 50% as useful
  • Look at existing performance benchmarks
    • are they good availability benchmarks?
    • e.g. SPECWeb99
      • closed-loop -- request rate affected by failures/TCP interactions
      • time resolution on the order of minutes

Group 4: Diagnostics and on-line testing

Issues

  • Verbosity vs. Abstraction in diagnostic reports
    • How much to report?
    • How to compress/abstract information?
    • How to know what are the important bits?
    • Tension comes from the fact that we may not know, a priori, what we want to look for: As a consequence, everything needs to be observable
  • Should diagnostics ever be turned off?
    • No
    • Must test and ship same system configuration. If we test with diagnostics, we must ship with diagnostics
    • With diagnostics always on, we can collect data on "normal" operation so we can use statistical techniques to identify outliers or to classify normal/abnormal system states/behavior.
  • Local vs. Global event correlation
    • Only so much one can observe/learn from local analysis of a single module or system.
    • Far more can be learned from correlating events across modules or system installations. e.g. We are more likely to discover that disks fail in batches if we correlate failures across many installations that only within single installations.
  • Internal vs. External testing [of components]
    • Related to clear-box vs. black-box testing
    • Internal testing can aid fault containment and effect fail-stop behavior
    • External testing required for end-to-end guarantees
    • External control of diagnostics and self-testing useful for:
      • If we want a module to do expensive testing only when not loaded
      • For human operators diagnosing system problems
    • Interesting example of extreme black-box diagnostics: Some IBM machines send "symptom string" back home in the event of a failure. The symptom string is some slice of machine state (e.g. registers). A database of previous symptoms and solutions can aid diagnosis.

Research Topics

  1. How can we create diagnostics that are cheap enough to leave in production systems?
  2. How much does the diagnostic interface of components need to be standardized if we want to have external control of diagnostic functions and correlate events across different modules?
  3. How well can we leverage the network effect of collecting diagnostics from multiple installations to compute higher-order statistics not possible with only a local view?

Modified: 12 May 2002