The SCADS Director: Scaling a Distributed Storage System Under Stringent Performance Requirements

Elasticity of cloud computing environments provides an economic incentive for automatic resource allocation of stateful systems running in the cloud. However, these systems have to meet strict performance Service-Level Objectives (SLOs) expressed using upper percentiles of request latency, such as the 99th. Such latency measurements are very noisy, which complicates the design of the dynamic resource allocation. We design and evaluate the SCADS Director, a control framework that reconfigures the storage system on-the-fly in response to workload changes using a performance model of the system. We demonstrate that such a framework can respond to both unexpected data hotspots and diurnal workload patterns without violating strict performance SLOs.

[1]  K. Åström Introduction to Stochastic Control Theory , 1970 .

[2]  Paul Barford,et al.  Generating representative Web workloads for network and server performance evaluation , 1998, SIGMETRICS '98/PERFORMANCE '98.

[3]  Martin Arlitt,et al.  Workload Characterization of the 1998 World Cup Web Site , 1999 .

[4]  Martin Arlitt,et al.  A workload characterization study of the 1998 World Cup Web site , 2000, IEEE Netw..

[5]  William LeFebvre,et al.  CNN.com: Facing a World Crisis , 2001, LiSA.

[6]  Amin Vahdat,et al.  Managing energy and server resources in hosting centers , 2001, SOSP.

[7]  Eric Anderson,et al.  Hippodrome: Running Circles Around Storage Administration , 2002, FAST.

[8]  Chenyang Lu,et al.  Aqueduct: Online Data Migration with Performance Guarantees , 2002, FAST.

[9]  Vijay V. Vazirani,et al.  Approximation Algorithms , 2001, Springer Berlin Heidelberg.

[10]  David R. Karger,et al.  Looking up data in P2P systems , 2003, CACM.

[11]  GhemawatSanjay,et al.  The Google file system , 2003 .

[12]  Prashant J. Shenoy,et al.  Dynamic Provisioning of Multi-tier Internet Applications , 2005, Second International Conference on Autonomic Computing (ICAC'05).

[13]  George Candea,et al.  Combining Visualization and Statistical Analysis to Improve Operator Confidence and Efficiency for Failure Detection and Localization , 2005, Second International Conference on Autonomic Computing (ICAC'05).

[14]  Ashvin Goel,et al.  Database replication policies for dynamic content applications , 2006, EuroSys.

[15]  Jin Chen,et al.  Autonomic Provisioning of Backend Databases in Dynamic Content Web Servers , 2006, 2006 IEEE International Conference on Autonomic Computing.

[16]  Randy H. Katz,et al.  SMART: An Integrated Multi-Action Advisor for Storage Systems , 2006, USENIX Annual Technical Conference, General Track.

[17]  Lui Sha,et al.  Adaptive Control of Multi-Tiered Web Applications Using Queueing Predictor , 2006, 2006 IEEE/IFIP Network Operations and Management Symposium NOMS 2006.

[18]  Rajarshi Das,et al.  A Hybrid Reinforcement Learning Approach to Autonomic Resource Allocation , 2006, 2006 IEEE International Conference on Autonomic Computing.

[19]  Werner Vogels,et al.  Dynamo: amazon's highly available key-value store , 2007, SOSP.

[20]  Nagarajan Kandasamy,et al.  Power and performance management of virtualized computing environments via lookahead control , 2008, 2008 International Conference on Autonomic Computing.

[21]  Antony I. T. Rowstron,et al.  Everest: Scaling Down Peak Loads Through I/O Off-Loading , 2008, OSDI.

[22]  Hans-Arno Jacobsen,et al.  PNUTS: Yahoo!'s hosted data serving platform , 2008, Proc. VLDB Endow..

[23]  Wilson C. Hsieh,et al.  Bigtable: A Distributed Storage System for Structured Data , 2006, TOCS.

[24]  J. Hellerstein,et al.  Optimizing Concurrency Levels in the . NET ThreadPool : A Case Study of Controller Design and Implementation , 2008 .

[25]  Christos Faloutsos,et al.  Using Utility to Provision Storage Systems , 2008, FAST.

[26]  Michael I. Jordan,et al.  Automatic exploration of datacenter performance regimes , 2009, ACDC '09.

[27]  Austin Donnelly,et al.  Sierra: a power-proportional, distributed storage system , 2009 .

[28]  David A. Patterson,et al.  SCADS: Scale-Independent Storage for Social Computing Applications , 2009, CIDR.

[29]  Michael I. Jordan,et al.  Statistical Machine Learning Makes Automatic Control Practical for Internet Datacenters , 2009, HotCloud.

[30]  Jeffrey Dean,et al.  Evolution and future directions of large-scale storage and computation systems at Google , 2010, SoCC '10.

[31]  Jeffrey S. Chase,et al.  Automated control for elastic storage , 2010, ICAC '10.

[32]  Michael I. Jordan,et al.  Characterizing, modeling, and generating workload spikes for stateful services , 2010, SoCC '10.

[33]  Karsten Schwan,et al.  Robust and flexible power-proportional storage , 2010, SoCC '10.

[34]  Randy H. Katz,et al.  Chukwa: A System for Reliable Large-Scale Log Collection , 2010, LISA.

[35]  Adam Silberstein,et al.  Benchmarking cloud serving systems with YCSB , 2010, SoCC '10.