Automating Datacenter Operations Using Machine Learning

Today's Internet datacenters run many complex and large-scale Web applications that are very difficult to manage. The main challenges are understanding user workloads and application performance, and quickly identifying and resolving performance problems. Statistical Machine Learning (SML) provides a methodology for quickly processing the large quantities of monitoring data generated by these applications, finding repeating patterns in their behavior, and building accurate models of their performance. This dissertation argues that SML is a useful tool for simplifying and automating datacenter operations and demonstrates application of SML to three important problems in this area: characterization and synthesis of workload spikes, dynamic resource allocation in stateful systems, and quick and accurate identification of recurring performance problems.

[1]  Armando Fox,et al.  Ensembles of models for automated diagnosis of system performance problems , 2005, 2005 International Conference on Dependable Systems and Networks (DSN'05).

[2]  Randy H. Katz,et al.  X-Trace: A Pervasive Network Tracing Framework , 2007, NSDI.

[3]  Michael I. Jordan,et al.  Detecting large-scale system problems by mining console logs , 2009, SOSP '09.

[4]  Karsten Schwan,et al.  Robust and flexible power-proportional storage , 2010, SoCC '10.

[5]  William LeFebvre,et al.  CNN.com: Facing a World Crisis , 2001, LiSA.

[6]  Surajit Chaudhuri,et al.  Proceedings of the 11th ACM Symposium on Cloud Computing , 2010 .

[7]  Ajay Gulati,et al.  Storage Workload Characterization and Consolidation in Virtualized Environments , 2008 .

[8]  Sudipto Guha,et al.  Stream Order and Order Statistics: Quantile Estimation in Random-Order Streams , 2009, SIAM J. Comput..

[9]  Azer Bestavros,et al.  Self-similarity in World Wide Web traffic: evidence and possible causes , 1996, SIGMETRICS '96.

[10]  John Allspaw,et al.  The Art of Capacity Planning: Scaling Web Resources , 2008 .

[11]  Mee Young Park,et al.  L1‐regularization path algorithm for generalized linear models , 2007 .

[12]  Michael I. Jordan,et al.  Characterizing, modeling, and generating workload spikes for stateful services , 2010, SoCC '10.

[13]  Asser N. Tantawi,et al.  An analytical model for multi-tier internet services and its applications , 2005, SIGMETRICS '05.

[14]  Jeffrey S. Chase,et al.  Correlating Instrumentation Data to System States: A Building Block for Automated Diagnosis and Control , 2004, OSDI.

[15]  Nagarajan Kandasamy,et al.  Power and performance management of virtualized computing environments via lookahead control , 2008, 2008 International Conference on Autonomic Computing.

[16]  Noah Treuhaft,et al.  Recovery Oriented Computing (ROC): Motivation, Definition, Techniques, and Case Studies , 2002 .

[17]  Andrew V. Goldberg,et al.  Quincy: fair scheduling for distributed computing clusters , 2009, SOSP '09.

[18]  Lili Qiu,et al.  The content and access dynamics of a busy Web server (poster) , 2000, SIGMETRICS.

[19]  Peter A. Flach,et al.  Improving Accuracy and Cost of Two-class and Multi-class Probabilistic Classifiers Using ROC Curves , 2003, ICML.

[20]  D. Ohsie,et al.  High speed and robust event correlation , 1996, IEEE Commun. Mag..

[21]  David R. Karger,et al.  Looking up data in P2P systems , 2003, CACM.

[22]  Randy H. Katz,et al.  Above the Clouds: A Berkeley View of Cloud Computing , 2009 .

[23]  Amin Vahdat,et al.  Experiences with Pip: finding unexpected behavior in distributed systems , 2005, SOSP '05.

[24]  Ashvin Goel,et al.  Database replication policies for dynamic content applications , 2006, EuroSys.

[25]  G. Voelker,et al.  On the scale and performance of cooperative Web proxy caching , 2000, OPSR.

[26]  Qi Zhang,et al.  Characterization of storage workload traces from production Windows Servers , 2008, 2008 IEEE International Symposium on Workload Characterization.

[27]  Randy H. Katz,et al.  Chukwa: A System for Reliable Large-Scale Log Collection , 2010, LISA.

[28]  J. Pitman,et al.  The two-parameter Poisson-Dirichlet distribution derived from a stable subordinator , 1997 .

[29]  Michael I. Jordan,et al.  Statistical Machine Learning Makes Automatic Control Practical for Internet Datacenters , 2009, HotCloud.

[30]  G. Casella,et al.  Statistical Inference , 2003, Encyclopedia of Social Network Analysis and Mining.

[31]  TWO-WEEK Loan COpy,et al.  University of California , 1886, The American journal of dental science.

[32]  Michael I. Jordan,et al.  Automatic exploration of datacenter performance regimes , 2009, ACDC '09.

[33]  Mor Harchol-Balter,et al.  Web servers under overload: How scheduling can help , 2006, TOIT.

[34]  Amin Vahdat,et al.  Managing energy and server resources in hosting centers , 2001, SOSP.

[35]  岡村 寛之 The International Conference on Dependable Systems and Networks(DSN 2005) , 2005 .

[36]  Armando Fox,et al.  Capturing, indexing, clustering, and retrieving system history , 2005, SOSP '05.

[37]  Hans-Arno Jacobsen,et al.  PNUTS: Yahoo!'s hosted data serving platform , 2008, Proc. VLDB Endow..

[38]  Wei-Ying Ma,et al.  Automated known problem diagnosis with event traces , 2006, EuroSys.

[39]  Sanjay Ghemawat,et al.  MapReduce: Simplified Data Processing on Large Clusters , 2004, OSDI.

[40]  David A. Patterson,et al.  Path-Based Failure and Evolution Management , 2004, NSDI.

[41]  Michael I. Jordan,et al.  Advanced tools for operators at amazon.com , 2006 .

[42]  Armando Fox,et al.  Three Research Challenges at the Intersection of Machine Learning, Statistical Induction, and Systems , 2005, HotOS.

[43]  Jeanna N. Matthews,et al.  Proceedings of the ACM SIGOPS 22nd symposium on Operating systems principles , 2009, SOSP'09 2009.

[44]  Austin Donnelly,et al.  Sierra: a power-proportional, distributed storage system , 2009 .

[45]  Lui Sha,et al.  Adaptive Control of Multi-Tiered Web Applications Using Queueing Predictor , 2006, 2006 IEEE/IFIP Network Operations and Management Symposium NOMS 2006.

[46]  A. James 2010 , 2011, Philo of Alexandria: an Annotated Bibliography 2007-2016.

[47]  George Candea,et al.  Combining Visualization and Statistical Analysis to Improve Operator Confidence and Efficiency for Failure Detection and Localization , 2005, Second International Conference on Autonomic Computing (ICAC'05).

[48]  Adam Wierman,et al.  Open Versus Closed: A Cautionary Tale , 2006, NSDI.

[49]  Richard Mortier,et al.  Using Magpie for Request Extraction and Workload Modelling , 2004, OSDI.

[50]  Mark Crovella,et al.  Diagnosing network-wide traffic anomalies , 2004, SIGCOMM '04.

[51]  Stephen P. Boyd,et al.  An Interior-Point Method for Large-Scale l1-Regularized Logistic Regression , 2007, J. Mach. Learn. Res..

[52]  Jeffrey Dean,et al.  Evolution and future directions of large-scale storage and computation systems at Google , 2010, SoCC '10.

[53]  阿杜 HP OpenView:将开放进行到底 , 2005 .

[54]  Venkata N. Padmanabhan,et al.  The content and access dynamics of a busy web site: findings and implicatins , 2000, SIGCOMM.

[55]  Soila Pertet,et al.  Fingerpointing correlated failures in replicated systems , 2007 .

[56]  D. Aldous Exchangeability and related topics , 1985 .

[57]  Scott Shenker,et al.  Delay scheduling: a simple technique for achieving locality and fairness in cluster scheduling , 2010, EuroSys '10.

[58]  Christopher Stewart,et al.  Performance modeling and system management for multi-component online services , 2005, NSDI.

[59]  Antony I. T. Rowstron,et al.  Everest: Scaling Down Peak Loads Through I/O Off-Loading , 2008, OSDI.

[60]  David E. Culler,et al.  The ganglia distributed monitoring system: design, implementation, and experience , 2004, Parallel Comput..

[61]  Evgenia Smirni,et al.  Injecting realistic burstiness to a traditional client-server benchmark , 2009, ICAC '09.

[62]  Rajarshi Das,et al.  A Hybrid Reinforcement Learning Approach to Autonomic Resource Allocation , 2006, 2006 IEEE International Conference on Autonomic Computing.

[63]  Martin Arlitt,et al.  A workload characterization study of the 1998 World Cup Web site , 2000, IEEE Netw..

[64]  Werner Vogels,et al.  Dynamo: amazon's highly available key-value store , 2007, SOSP.

[65]  Armando Fox,et al.  HiLighter: Automatically Building Robust Signatures of Performance Behavior for Small- and Large-Scale Systems , 2008, SysML.

[66]  Dawn B. Woodard,et al.  Model-Based Clustering for Online Crisis Identification in Distributed Computing , 2009 .

[67]  Xin Chen,et al.  A Popularity-Based Prediction Model for Web Prefetching , 2003, Computer.

[68]  Archana Ganapathi,et al.  Predicting and Optimizing System Utilization and Performance via Statistical Machine Learning , 2009 .

[69]  Jin Chen,et al.  Autonomic Provisioning of Backend Databases in Dynamic Content Web Servers , 2006, 2006 IEEE International Conference on Autonomic Computing.

[70]  David A. Patterson,et al.  SCADS: Scale-Independent Storage for Social Computing Applications , 2009, CIDR.

[71]  Brian N. Bershad,et al.  Using Computers to Diagnose Computer Problems , 2003, HotOS.

[72]  Prashant J. Shenoy,et al.  Dynamic Provisioning of Multi-tier Internet Applications , 2005, Second International Conference on Autonomic Computing (ICAC'05).

[73]  Martin Arlitt,et al.  Workload Characterization of the 1998 World Cup Web Site , 1999 .

[74]  Luiz André Barroso,et al.  The Datacenter as a Computer: An Introduction to the Design of Warehouse-Scale Machines , 2009, The Datacenter as a Computer: An Introduction to the Design of Warehouse-Scale Machines.

[75]  Balachander Krishnamurthy,et al.  Flash crowds and denial of service attacks: characterization and implications for CDNs and web sites , 2002, WWW '02.

[76]  J. Hellerstein,et al.  Optimizing Concurrency Levels in the . NET ThreadPool : A Case Study of Controller Design and Implementation , 2008 .

[77]  Daniel A. Menascé,et al.  Resource Allocation for Autonomic Data Centers using Analytic Performance Models , 2005, Second International Conference on Autonomic Computing (ICAC'05).

[78]  Paul Barford,et al.  Generating representative Web workloads for network and server performance evaluation , 1998, SIGMETRICS '98/PERFORMANCE '98.

[79]  Shivnath Babu,et al.  Guided Problem Diagnosis through Active Learning , 2008, 2008 International Conference on Autonomic Computing.

[80]  Galen C. Hunt,et al.  Debugging in the (very) large: ten years of implementation and experience , 2009, SOSP '09.

[81]  J. Meigs,et al.  WHO Technical Report , 1954, The Yale Journal of Biology and Medicine.

[82]  Adam Silberstein,et al.  Benchmarking cloud serving systems with YCSB , 2010, SoCC '10.

[83]  George Candea,et al.  Toward Self-Healing Multitier Services , 2007, 2007 IEEE 23rd International Conference on Data Engineering Workshop.