Notes from ROC Retreat Feedback Session, 6/12/2002
--------------------------------------------------

1) Bill Tetzlaff, IBM
  - interested in fast restart
  - glad to see data on failures
    - unfortunate that have to go back to hand-written problem reports; shows
      ineffectiveness of data-collection/etc. system
  - proposal for next oceanstore app: email attachments; big & tremendously 
    duplicated (seems to be arguing for single-instance store); provide
    ability to search them, find latest version, etc.
  - oceanstore: look at disconnected/weakly-connected operation
  - MTTR: big stateful systems; 20-40 minutes common. This is a big problem
    and very different from stateless web-server/cache. Also ability to
    come up on different hardware [VMs?] or even on different software--as
    a way to try out new software or revert to older software [Undo can
    be spun toward this?]. Need 1-2 orders of magnitude faster, and probably
    needs interesting new hardware to store state [IBM 3090-like, NVRAM,...]
    - important challenge is segregating state: you're trying to escape from
      state-gone-bad and you need to be able to discard it.

2) Kim Keeton, HP
  - Overlap w/HP work in storage system space
    - utility functions, multi-constraint optimization for perf. and avail.
    - Chang Lee PhD (CMU) on multi-constraint optimization
  - declarative specifications are important: users/admins specify goals,
    requirements, *intentions* and let the system do the right thing
    - they've done this in perf space (storage), starting into avail
      - IWGOS paper on Rome (HPs spec language), upcoming SIGOPS
  - time-travel is important; how to specify in interface
  - collab through Citris on avail stuff
  - oceanstore: so, does performance matter, after all?
    - we keep saying it doesn't, but Ostore jumped on because of long latency
    - challenge: pull-the-nic demo at next retreat
  - how real will ostore be? backup/archival service for real users?
  - wants to hear more about introspection: what to monitor, how to use the
    data
  - oceanstore: a lot of the coolness is based on the properties of tapestry,
    requiring reader to relatively-deeply understand tapestry. Needs a more
    self-contained explanation
  - could you use Oceanstore as storage layer for undo?
  - what about interesting file semantics atop Ostore, things like Elephant
    where important versions are identified, etc.
  - bridging the gap between FT and Systems community
    - another EASY-style workshop before ASPLOS w/1-day tutorial on traditional
      FT/dependability. Submission deadline in July.

3) Mark Verber, Tellme
  - a lot of services don't have a way to collect data and verify that it's 
    good; think about tools & methodologies that people could use to do this
  - excited about undo work toward algebra describing system/operation
    properties, some sort of declarative specification
    - last retreat, panned undo. Now sees approach of building a framework
      that people could plug into is a very promising idea.
  - ostore: build something that people could actually use
    - wants to see developers' home directories in Oceanstore by next time
    - avoid danger of building fancy infrastructure and never getting to the
      apps, since the apps really show the edge conditions, unexpected
      constraints, etc. best source of insight. Suggests 3-4 apps
    - be ruthless about function so that you get something done yet still
      prove some important points
  - statistical stability
  - MTTR important and oft-neglected, but don't drop the quality bar as you
    focus on MTTR.
    - like Yahoo example of not doing mem-mgt, garbage collection.
    - there are domains where this isn't appropriate/doesn't work
    - use more care in building things; figure out how to help people do this
  - adaptive systems, statistically-stable systems: interesting
    - but if people can't build single-component systems to work effectively,
      what hope is there for far more complex systems?
    - today, hard/impossible to even build a stable, reliable ethernet
  - build-in instrumentation, be rigorous about collecting it, do data analysis
    and visualization as you build/deploy it
    - they've found it very valuable @ Tellme
  - breakout on storage API valuable
    - too expensive and unscaleable to store everything in nice ACID store
    - usual solutions are to buy weblogic or build adhoc system
    - should formalize what constraints are, what tradeoffs are; develop
      standard set of vocabulary identifying canonical architectures;
      look again @ Bayou, J2EE service classes
    - engineers actually do know the constraints/requirements of what they're
      building, so could express to API

4) James Hamilton, Microsoft
  - ROC is "totally cool", the right area, exactly what industry needs, we
    have opportunity to make significant improvements/contributions because
    the world today is such a disaster
  - undo: initially didn't buy it or its importance, but now has come to 
    realize how essential it is, especially after seeing the data he's
    collected
  - stanford/ucb collaboration is a great thing
  - likes combined DB/OS course @ Berkeley 262a/b
    - Kim says it may be in danger of being discontinued
      - 262b is getting too specialized; maybe better to have combined 262a,
	then separate versions of 262b on different specializations
      - James says it's better to put people from diff disciplines together
  - likes approach of quantifying properties (QAPSL stuff)
    - thinks that some of stuff in talk was too black-and-white; even banks
      make continuous tradeoffs and get pretty far down toward availability
      over quality
      - Bill: do they later have a way to discover degraded quality? Yes,
	via audits, etc. And loss rate is low enough that it's worth it, since
	cost of getting it better would be much higher than loss penalty
  - everyone wants ACID, but not deadlocks, single points of failure,
    unserved customers, etc. So they're willing to sacrifice lots
    - danger in George's stuff is that there are lots and lots of cliffs in
      the space
  - RAINS: ROC is not a patch for bad systems. Recovery costs a lot of
    resources and you don't want to do it too often. You've got to have it,
    of course, since systems aren't perfect, but RAINS goes too far in
    throwing stuff out too quickly. Maybe an interesting case to study, though.
  - VMs: exception handling doesn't work for fault containment in practice,
    and maybe VMs could act as better fault containment domains
  - internet service failure data: would like to see it turn into benchmarks
  - theory of undo work: until you build a system, not interesting
    - 1) need real data to motivate that admins are the problem
    - 2) need to have a running system and study the results w/ and w/o undo
  - oceanstore
    - performance does matter. don't make a system suck because performance
      is a problem. Worth getting it within a factor of 2-3, but that's
      enough, then focus on real problems.
    - need data on cost to buy and administer a terabyte, and use that to
      motivate. if focus too much on performance, you'll lose the value	of
      oceanstore, which is to get the admin out of the game
    - skeptical of security (privacy), deletion, revocation story
      - worried that no way to delete data; Kim agrees. It's the way the world
	works
      - [discussion of deletion and feasibility]
      - is throwing the keys away sufficient? Yes, if you trust the security
	story, which James doesn't trust. Due to Moore's law, if there's
	an important document, you can marshal resources to crack it
	- need crypto that can last 50 years when attacked by 25% of world's
	  total computing resources
  - Geo: legacy systems issue. Is it worthwhile to apply ROC to legacy systems?
    - yes. Look at Schwab trying to integrate old crufty DB system with new
      tech. So legacy is a big source of problems and an area that needs to
      be addressed, and maybe ROC could do this.

5) Nisha Talagala, Sun
  - liked the probablistic consistency work
    - but to trust it, have to believe (and define well) the model of failure
  - data gathering very valuable
  - operator error: should also do work on diagnosis. Can often recover without
    diagnosis, but problems may well recurr over and over
  - focus on MTTR interesting. Going toward a more reasonable definition of
    availability (removing time-averaging component). Should continue work
    toward better definitions of availability
    - Kim says also look at performability, not just level of 9's, but 
      amount of performance degradation vs. normal behavior and for what
      length of time
      - [discussion about whether customers/users will accept more complex
	definitions that can't be reduced to a single number; consensus is
	yes if it provides significant value]
      - Mark: collect lots of metrics. Try to classify behavior into:
	- successful
	- delayed (txn succeeded but exceeded performance/latency window]
	- degenerate (didn't get ideal/perfect function, but something happened
	  to allow forward progress. Example: misqueueing request into less-
	  efficient, but still working queue)
  - likes archival part of oceanstore; good application, hard problem, ostore
    matches needs pretty well

6) Jeff Darcy, EMC
  - excited to see oceanstore working
  - george's stuff: look at utility functions in terms of market impact of
    decisions. Incorporate ideas of uncertainty and design/impl. risk into
    tradeoffs, since not captured
  - RAINS: look at rolling checkpoints instead of/in addition to rolling 
    restarts
  - oceanstore: run protocols for insert, etc. through verification tools
    (smit??, murphy??)
    - fact that cost of signing things is a bottleneck is an important
      result to disseminate
    - apps: people are unlikely to adopt new distributed filesystems;
      email attachement problem isn't great; web cache is probably best
      application now to motivate it
    - need more support for evolvability if really want 50+ year durability,
      especially with data formats, protocol versions, etc.
    - need to think about Digital Rights Mgmt; won't get away w/ignoring it
  - idea of time-travel data store w/better granularity than snapshots/
    checkpoints is very interesting & important. Conceptual models, APIs,
    efficiency of implementation.

7) Mendel Rosenblum, VMWare & Stanford
  - analysis of failures will be a real contribution
  - VM approaches are interesting. Systems have 100M lines of code; not 
    surprising that there are bugs, and unlikely that you could rewrite
    it to be bug-free even if you took 10 years w/o change in functionality.
    ROC is great approach, but extremely hard problem.
  - No related work slides == bad. Even a list just to show you've read it.
  - oceanstore: he worked in project that put a lot of effort into taking over
    the world (sprite). It didn't. Don't invest too much time into taking over
    the world, especially if it overwhelms the research.
    - but you should still "eat your own dog food"

8) Blue Lang, Veritas
  - get outside the box and stay there
  - research is what's important. don't worry about shipping products.
  - won't be able to build infinitely-provable systems, so don't bother
  - ROC is a patch for unreliable systems. We won't be able to stay 
    ahead of the bug curve for the forseeable future, especially as systems
    continue to scale
  - excited about API discussion, would like to move forward, collaborate
    w/Veritas
  - it's not OK to sacrifice performance. But OK to sacrifice real perf.
    in exchange for perceived performance
    - but even unoptimized new algorithms are interesting, and should
      be published. They pay people to optimize, but the ideas are what's
      really important to get from research community
  - almost never see full root-cause analysis in large data centers/sites;
    operators/customers don't want to deploy the manpower necessary to do
    the diagnosis (even at employee-centric place like IGS). Maybe 1/1000
    problems ever got a real RCA in IBM Global Services.
    - advantage of ROC is may never need to get to that point
  - don't necessarily have to implement solutions, but design implementable
    ones, publish, and they [industry] will implement them
  - re: simplicity: scale really changes things. 
  - oceanstore & web caches. It's really easy to configure web caches today;
    it's not clear why oceanstore is needed to simplify things. Talk to 
    real administrators to sanity check these things.
  - intent-based logging/apis/etc (undo stuff) is a massive idea that could
    really change things.

9) Lisa Spainhower, IBM, via Armando
  - re: FIG, fault injection, etc. Keep in mind the difference between
    fault injection (bit flips, etc.) and *error* injection (things that
    become apparent at level of app or API).
    - error injection is the most important, more representative of real
      observed failures
  - re: benchmarking: to keep goodwill of indu collaborators, don't design
    benchmarks whose purpose is to embarass people into doing the right 
    thing. For example, TPC is so expensive that you're not going to run
    it unless you look good. Do benchmarks that are small, that people can
    run individually, etc. Don't create the kind of benchmark that's difficult
    to run, costly to put together, and primarily will embarass people (noone
    will report their results)
    - Mendel: there's a hazard to small benchmarks.