Rotting Bandits

The Multi-Armed Bandits (MAB) framework highlights the trade-off between acquiring new knowledge (Exploration) and leveraging available knowledge (Exploitation). In the classical MAB problem, a decision maker must choose an arm at each time step, upon which she receives a reward. The decision maker's objective is to maximize her cumulative expected reward over the time horizon. The MAB problem has been studied extensively, specifically under the assumption of the arms' rewards distributions being stationary, or quasi-stationary, over time. We consider a variant of the MAB framework, which we termed Rotting Bandits, where each arm's expected reward decays as a function of the number of times it has been pulled. We are motivated by many real-world scenarios such as online advertising, content recommendation, crowdsourcing, and more. We present algorithms, accompanied by simulations, and derive theoretical guarantees.

[1]  Rémi Munos,et al.  Thompson Sampling: An Asymptotically Optimal Finite-Time Analysis , 2012, ALT.

[2]  T. L. Lai Andherbertrobbins Asymptotically Efficient Adaptive Allocation Rules , 2022 .

[3]  Omar Besbes,et al.  Stochastic Multi-Armed-Bandit Problem with Non-stationary Rewards , 2014, NIPS.

[4]  Shie Mannor,et al.  Piecewise-stationary bandit problems with side observations , 2009, ICML '09.

[5]  P. Whittle Arm-Acquiring Bandits , 1981 .

[6]  Qing Zhao,et al.  Extended UCB Policy for Multi-Armed Bandit with Light-Tailed Reward Distributions , 2011, ArXiv.

[7]  Ambuj Tewari,et al.  Online Bandit Learning against an Adaptive Adversary: from Regret to Policy Regret , 2012, ICML.

[8]  Wael Badawy,et al.  Automatic License Plate Recognition (ALPR): A State-of-the-Art Review , 2013, IEEE Transactions on Circuits and Systems for Video Technology.

[9]  A. Mandelbaum,et al.  Multi-armed bandits in discrete and continuous time , 1998 .

[10]  W. R. Thompson ON THE LIKELIHOOD THAT ONE UNKNOWN PROBABILITY EXCEEDS ANOTHER IN VIEW OF THE EVIDENCE OF TWO SAMPLES , 1933 .

[11]  P. Whittle Restless Bandits: Activity Allocation in a Changing World , 1988 .

[12]  Nicholas R. Jennings,et al.  Efficient Crowdsourcing of Unknown Experts using Multi-Armed Bandits , 2012, ECAI.

[13]  Shie Mannor,et al.  Thompson Sampling for Complex Online Problems , 2013, ICML.

[14]  Peter Auer,et al.  Finite-time Analysis of the Multiarmed Bandit Problem , 2002, Machine Learning.

[15]  Shipra Agrawal,et al.  Further Optimal Regret Bounds for Thompson Sampling , 2012, AISTATS.

[16]  Elad Hazan,et al.  Better Algorithms for Benign Bandits , 2009, J. Mach. Learn. Res..

[17]  Deepayan Chakrabarti,et al.  Bandits for Taxonomies: A Model-based Approach , 2007, SDM.

[18]  Deepak Agarwal,et al.  Spatio-temporal models for estimating click-through rate , 2009, WWW '09.

[19]  H. Robbins Some aspects of the sequential design of experiments , 1952 .

[20]  Aurélien Garivier,et al.  On Upper-Confidence Bound Policies for Non-Stationary Bandit Problems , 2008 .

[21]  Peter Auer,et al.  The Nonstochastic Multiarmed Bandit Problem , 2002, SIAM J. Comput..

[22]  Aurélien Garivier,et al.  The KL-UCB Algorithm for Bounded Stochastic Bandits and Beyond , 2011, COLT.

[23]  Frank Thomson Leighton,et al.  The value of knowing a demand curve: bounds on regret for online posted-price auctions , 2003, 44th Annual IEEE Symposium on Foundations of Computer Science, 2003. Proceedings..

[24]  Tao Qin,et al.  Time-Decaying Bandits for Non-stationary Systems , 2014, WINE.

[25]  Gábor Lugosi,et al.  Prediction, learning, and games , 2006 .

[26]  A. Mandelbaum CONTINUOUS MULTI-ARMED BANDITS AND MULTIPARAMETER PROCESSES , 1987 .

[27]  Filip Radlinski,et al.  Mortal Multi-Armed Bandits , 2008, NIPS.

[28]  J. Gittins Bandit processes and dynamic allocation indices , 1979 .

[29]  Mingyan Liu,et al.  Online Learning of Rested and Restless Bandits , 2011, IEEE Transactions on Information Theory.

[30]  Baruch Awerbuch,et al.  Adaptive routing with end-to-end feedback: distributed learning and geometric approaches , 2004, STOC '04.

[31]  Eli Upfal,et al.  Adapting to a Changing Environment: the Brownian Restless Bandits , 2008, COLT.

[32]  Rémi Munos,et al.  A Finite-Time Analysis of Multi-armed Bandits Problems with Kullback-Leibler Divergences , 2011, COLT.