9/20 - Germán Goldszmidt, Manager, Advanced Cluster Technologies, IBM Research
The Océano project - Intelligent Infrastructure for eUtilities. Océano is a prototype of a highly available, scaleable, and manageable infrastructure for Utilities. It enables multiple customer workloads to be hosted on a collection of sequentially shared resources. These resources are dynamically provisioned from "farms" of servers interconnected by switched LANs. The hosting environment is divided into secure domains, where each domain supports a single customer or workload. These domains are dynamic: the resources assigned to them may be augmented when load increases and reduced when load dips. This dynamic resource allocation enables flexible Infrastructure Service Level Agreements (I-SLAs) between the hosted customers and the service providers. The Océano prototype integrates multiple subsystems in 3 layers: Infrastructure, Control, and Policy. The Infrastructure Layer implements topology discovery, hearthbeating, network management and dynamic configuration. The Control Layer implements server management, priming, network configuration, resource monitoring, and request throttling. The Policy Layer supports I-SLA contract definitions, monitoring and enforcement, problem determination, and predictive/reactive allocations. We will present the overall architecture of Océano and discuss some of its future research challenges.
9/20 - Lisa Spainhower, IBM Research
9/27 - Mendel Rosenblum, Stanford University and VMware
Mendel Rosenblum is an Associate Professor of Computer Science at Stanford University, where he leads the operating systems research group of the FLASH project. Together with his students, he developed the Hive operating system, the SimOS machine simulator, and the Disco virtual machine monitor. He holds Ph.D. and M.S. degrees in computer science from the University of California at Berkeley and a B.A. in mathematics from the University of Virginia.
11/1 - Brendan Murphy, Microsoft Research
Measuring System Behaviour in the field. Research into the reliability and availability of computer systems/software is often based on data captured during lab experiments, which differs significantly from that experienced by the end user in the real world. Occasionally a research project will attempt to use data from the field but often problems occur during analysis of this data. Computer and software manufacturers have developed techniques and have been analysing data captured from customer sites for many years. But as they are reluctant to publish their data, subsequently their analysis techniques also go unpublished
This talk describes the techniques (and problems) for measuring system reliability, availability and recoverability in the field. The talk focuses on characterizing the behaviour of individual systems providing some examples of the results of this analysis on systems running VMS, UNIX and NT. The talk also discusses attempts that have been made to characterize the behaviour of complex applications and configurations, highlighting opportunities for future research.
11/8 - Jim Gray, Microsoft Research
Internet Reliability. The talk begins with generic comments on availability/reliability. Then it looks at "classic" OLTP systems that deliver 4 or 5 9s. Then talk turns to "modern" Internet systems that struggle to get 2 9s. To be specific I will use Microsoft and Ebay (along with the Netcraft data). It will turn out that hackers, operations, and hosters each cost you 1% with current technology. So, we have a challenge.