Adaptive Bandits: Towards the best history-dependent strategy

We consider multi-armed bandit games with possibly adaptive opponents. We introduce models Theta of constraints based on equivalence classes on the common history (information shared by the player and the opponent) which dene two learning scenarios: (1) The opponent is constrained, i.e. he provides rewards that are stochastic functions of equivalence classes dened by some model theta*\in Theta. The regret is measured with respect to (w.r.t.) the best history-dependent strategy. (2) The opponent is arbitrary and we measure the regret w.r.t. the best strategy among all mappings from classes to actions (i.e. the best history-class-based strategy) for the best model in Theta. This allows to model opponents (case 1) or strategies (case 2) which handles nite memory, periodicity, standard stochastic bandits and other situations. When Theta={theta}, i.e. only one model is considered, we derive tractable algorithms achieving a tight regret (at time T) bounded by ~O(sqrt(TAC)), where C is the number of classes of theta. Now, when many models are available, all known algorithms achieving a nice regret O(sqrt(T)) are unfortunately not tractable and scale poorly with the number of models |Theta|. Our contribution here is to provide tractable algorithms with regret bounded by T^{2/3}C^{1/3} log(|Theta|)^{1/2}.

[1]  Yoav Freund,et al.  A decision-theoretic generalization of on-line learning and an application to boosting , 1997, EuroCOLT.

[2]  S. Hart,et al.  A simple adaptive procedure leading to correlated equilibrium , 2000 .

[3]  Ehud Lehrer,et al.  A wide range no-regret theorem , 2003, Games Econ. Behav..

[4]  Dean P. Foster,et al.  Regret in the On-Line Decision Problem , 1999 .

[5]  Robert D. Kleinberg,et al.  Regret bounds for sleeping experts and bandits , 2010, Machine Learning.

[6]  Marcus Hutter,et al.  Feature Reinforcement Learning: Part I. Unstructured MDPs , 2009, J. Artif. Gen. Intell..

[7]  Gilles Stoltz Incomplete information and internal regret in prediction of individual sequences , 2005 .

[8]  Jean-Yves Audibert,et al.  Minimax Policies for Adversarial and Stochastic Bandits. , 2009, COLT 2009.

[9]  Ronald Ortner Online Regret Bounds for Markov Decision Processes with Deterministic Transitions , 2008, ALT.

[10]  Nicolò Cesa-Bianchi,et al.  Potential-Based Algorithms in On-Line Prediction and Game Theory , 2003, Machine Learning.

[11]  Varun Kanade,et al.  Sleeping Experts and Bandits with Stochastic Action Availability and Adversarial Rewards , 2009, AISTATS.

[12]  Csaba Szepesvári,et al.  Exploration-exploitation tradeoff using variance estimates in multi-armed bandits , 2009, Theor. Comput. Sci..

[13]  Sébastien Bubeck Bandits Games and Clustering Foundations , 2010 .

[14]  Csaba Szepesvári,et al.  Variance estimates and exploration function in multi-armed bandit , 2008 .

[15]  Yoav Freund,et al.  A decision-theoretic generalization of on-line learning and an application to boosting , 1995, EuroCOLT.

[16]  Peter Auer,et al.  Finite-time Analysis of the Multiarmed Bandit Problem , 2002, Machine Learning.

[17]  Marcus Hutter,et al.  On the Possibility of Learning in Reactive Environments with Arbitrary Dependence , 2008, Theor. Comput. Sci..

[18]  Louis Wehenkel,et al.  Clinical data based optimal STI strategies for HIV: a reinforcement learning approach , 2006, Proceedings of the 45th IEEE Conference on Decision and Control.

[19]  H. Robbins Some aspects of the sequential design of experiments , 1952 .

[20]  Yishay Mansour,et al.  From External to Internal Regret , 2005, J. Mach. Learn. Res..

[21]  Peter Auer,et al.  The Nonstochastic Multiarmed Bandit Problem , 2002, SIAM J. Comput..