Automatic exploration of datacenter performance regimes

Horizontally scalable Internet services present an opportunity to use automatic resource allocation strategies for system management in the datacenter. In most of the previous work, a controller employs a performance model of the system to make decisions about the optimal allocation of resources. However, these models are usually trained offline or on a small-scale deployment and will not accurately capture the performance of the controlled application. To achieve accurate control of the web application, the models need to be trained directly on the production system and adapted to changes in workload and performance of the application. In this paper we propose to train the performance model using an exploration policy that quickly collects data from different performance regimes of the application. The goal of our approach for managing the exploration process is to strike a balance between not violating the performance SLAs and the need to collect sufficient data to train an accurate performance model, which requires pushing the system close to its capacity. We show that by using our exploration policy, we can train a performance model of a Web 2.0 application in less than an hour and then immediately use the model in a resource allocation controller.

[1]  Richard S. Sutton,et al.  Introduction to Reinforcement Learning , 1998 .

[2]  Amin Vahdat,et al.  Managing energy and server resources in hosting centers , 2001, SOSP.

[3]  Daniel A. Menascé,et al.  Resource Allocation for Autonomic Data Centers using Analytic Performance Models , 2005, Second International Conference on Autonomic Computing (ICAC'05).

[4]  Prashant J. Shenoy,et al.  Dynamic Provisioning of Multi-tier Internet Applications , 2005, Second International Conference on Autonomic Computing (ICAC'05).

[5]  George Candea,et al.  Combining Visualization and Statistical Analysis to Improve Operator Confidence and Efficiency for Failure Detection and Localization , 2005, Second International Conference on Autonomic Computing (ICAC'05).

[6]  Christopher Stewart,et al.  Performance modeling and system management for multi-component online services , 2005, NSDI.

[7]  Jeffrey S. Chase,et al.  Active and accelerated learning of cost models for optimizing scientific applications , 2006, VLDB.

[8]  Larry Wasserman,et al.  All of Nonparametric Statistics (Springer Texts in Statistics) , 2006 .

[9]  J. Chase,et al.  Active Sampling for Accelerated Learning of Performance Models , 2006 .

[10]  Lui Sha,et al.  Adaptive Control of Multi-Tiered Web Applications Using Queueing Predictor , 2006, 2006 IEEE/IFIP Network Operations and Management Symposium NOMS 2006.

[11]  Rajarshi Das,et al.  A Hybrid Reinforcement Learning Approach to Autonomic Resource Allocation , 2006, 2006 IEEE International Conference on Autonomic Computing.

[12]  John Allspaw,et al.  The Art of Capacity Planning: Scaling Web Resources , 2008 .

[13]  A. Fox,et al.  Cloudstone : Multi-Platform , Multi-Language Benchmark and Measurement Tools for Web 2 . 0 , 2008 .

[14]  Nagarajan Kandasamy,et al.  Power and performance management of virtualized computing environments via lookahead control , 2008, 2008 International Conference on Autonomic Computing.

[15]  Jeffrey S. Chase,et al.  Cutting Corners: Workbench Automation for Server Benchmarking , 2008, USENIX Annual Technical Conference.

[16]  Herodotos Herodotou,et al.  Automated Experiment-Driven Management of (Database) Systems , 2009, HotOS.