Best arm identification in multi-armed bandits with delayed feedback

We propose a generalization of the best arm identification problem in stochastic multi-armed bandits (MAB) to the setting where every pull of an arm is associated with delayed feedback. The delay in feedback increases the effective sample complexity of standard algorithms, but can be offset if we have access to partial feedback received before a pull is completed. We propose a general framework to model the relationship between partial and delayed feedback, and as a special case we introduce efficient algorithms for settings where the partial feedback are biased or unbiased estimators of the delayed feedback. Additionally, we propose a novel extension of the algorithms to the parallel MAB setting where an agent can control a batch of arms. Our experiments in real-world settings, involving policy search and hyperparameter optimization in computational sustainability domains for fast charging of batteries and wildlife corridor construction, demonstrate that exploiting the structure of partial feedback can lead to significant improvements over baselines in both sequential and parallel MAB.

[1]  Robert E. Bechhofer,et al.  A Sequential Multiple-Decision Procedure for Selecting the Best One of Several Normal Populations with a Common Unknown Variance, and Its Use with Various Experimental Designs , 1958 .

[2]  E. Paulson A Sequential Procedure for Selecting the Population with the Largest Mean from $k$ Normal Populations , 1964 .

[3]  H. Robbins,et al.  Iterated logarithm inequalities. , 1967, Proceedings of the National Academy of Sciences of the United States of America.

[4]  Andrew W. Moore,et al.  Hoeffding Races: Accelerating Model Selection Search for Classification and Function Approximation , 1993, NIPS.

[5]  Shie Mannor,et al.  PAC Bounds for Multi-armed Bandit and Markov Decision Processes , 2002, COLT.

[6]  E. Ordentlich,et al.  On delayed prediction of individual sequences , 2002, Proceedings IEEE International Symposium on Information Theory,.

[7]  Ashish Sabharwal,et al.  Connections in Networks: A Hybrid Approach , 2008, CPAIOR.

[8]  Rémi Munos,et al.  Pure Exploration in Multi-armed Bandits Problems , 2009, ALT.

[9]  Christian Igel,et al.  Hoeffding and Bernstein races for selecting policies in evolutionary direct policy search , 2009, ICML '09.

[10]  Andreas Krause,et al.  Information-Theoretic Regret Bounds for Gaussian Process Optimization in the Bandit Setting , 2009, IEEE Transactions on Information Theory.

[11]  Kevin Leyton-Brown,et al.  Automated Configuration of Mixed Integer Programming Solvers , 2010, CPAIOR.

[12]  Dominik D. Freydenberger,et al.  Can We Learn to Gamble Efficiently? , 2010, COLT.

[13]  Andreas Krause,et al.  Contextual Gaussian Process Bandit Optimization , 2011, NIPS.

[14]  Andreas Krause,et al.  Parallelizing Exploration-Exploitation Tradeoffs with Gaussian Process Bandit Optimization , 2012, ICML.

[15]  Jasper Snoek,et al.  Practical Bayesian Optimization of Machine Learning Algorithms , 2012, NIPS.

[16]  Bart Selman,et al.  Learning Policies for Battery Usage Optimization in Electric Vehicles , 2012, ECML/PKDD.

[17]  Alessandro Lazaric,et al.  Best Arm Identification: A Unified Approach to Fixed Budget and Fixed Confidence , 2012, NIPS.

[18]  J. C. Burns,et al.  Predicting and Extending the Lifetime of Li-Ion Batteries , 2013 .

[19]  András György,et al.  Online Learning under Delayed Feedback , 2013, ICML.

[20]  Matthew W. Hoffman,et al.  Exploiting correlation and budget constraints in Bayesian multi-armed bandit optimization , 2013, 1303.6746.

[21]  Sébastien Bubeck,et al.  Multiple Identifications in Multi-Armed Bandits , 2012, ICML.

[22]  Robert D. Nowak,et al.  Best-arm identification algorithms for multi-armed bandits in the fixed confidence setting , 2014, 2014 48th Annual Conference on Information Sciences and Systems (CISS).

[23]  Matthew Malloy,et al.  lil' UCB : An Optimal Exploration Algorithm for Multi-Armed Bandits , 2013, COLT.

[24]  T. Baumhöfer,et al.  Production caused variation in capacity aging trend and correlation to initial cell performance , 2014 .

[25]  Simon F. Schuster,et al.  Lithium-ion cell-to-cell variation during battery electric vehicle operation , 2015 .

[26]  Mikio L. Braun,et al.  Fast cross-validation via sequential testing , 2012, J. Mach. Learn. Res..

[27]  Yifan Wu,et al.  On Identifying Good Options under Combinatorially Structured Feedback in Finite Noisy Environments , 2015, ICML.

[28]  Vianney Perchet,et al.  Batched Bandit Problems , 2015, COLT.

[29]  Frank Hutter,et al.  Initializing Bayesian Hyperparameter Optimization via Meta-Learning , 2015, AAAI.

[30]  Xiaosong Hu,et al.  Optimal charging of batteries via a single particle model with electrolyte and thermal dynamics , 2016, 2016 American Control Conference (ACC).

[31]  Aurélien Garivier,et al.  On the Complexity of Best-Arm Identification in Multi-Armed Bandit Models , 2014, J. Mach. Learn. Res..

[32]  Stefano Ermon,et al.  Sparse Gaussian Processes for Bayesian Optimization , 2016, UAI.

[33]  Stefano Ermon,et al.  Adaptive Concentration Inequalities for Sequential Decision Problems , 2016, NIPS.

[34]  Robert D. Nowak,et al.  Top Arm Identification in Multi-Armed Bandits with Batch Arm Pulls , 2016, AISTATS.

[35]  Ameet Talwalkar,et al.  Non-stochastic Best Arm Identification and Hyperparameter Optimization , 2015, AISTATS.

[36]  Stefano Ermon,et al.  Bayesian Optimization of FEL Performance at LCLS , 2016 .

[37]  Maitane Berecibar,et al.  State of health battery estimator enabling degradation diagnosis: Model and algorithm description , 2017 .

[38]  Ameet Talwalkar,et al.  Hyperband: A Novel Bandit-Based Approach to Hyperparameter Optimization , 2016, J. Mach. Learn. Res..

[39]  Miroslav Krstic,et al.  Battery State Estimation for a Single Particle Model With Electrolyte Dynamics , 2017, IEEE Transactions on Control Systems Technology.

[40]  Chen Li,et al.  Failure statistics for commercial lithium ion batteries: A study of 24 pouch cells , 2017 .