Stochastic Multi-Armed Bandits with Unrestricted Delay Distributions

We study the stochastic Multi-Armed Bandit (MAB) problem with random delays in the feedback received by the algorithm. We consider two settings: the reward-dependent delay setting, where realized delays may depend on the stochastic rewards, and the reward-independent delay setting. Our main contribution is algorithms that achieve near-optimal regret in each of the settings, with an additional additive dependence on the quantiles of the delay distribution. Our results do not make any assumptions on the delay distributions: in particular, we do not assume they come from any parametric family of distributions and allow for unbounded support and expectation; we further allow for infinite delays where the algorithm might occasionally not observe any feedback.

[1]  Pooria Joulani,et al.  Adapting to Delays and Data in Adversarial Multi-Armed Bandits , 2020, ICML.

[2]  Csaba Szepesvari,et al.  Bandit Algorithms , 2020 .

[3]  Michal Valko,et al.  Stochastic bandits with arm-dependent delays , 2020, ICML.

[4]  Julian Zimmert,et al.  An Optimal Algorithm for Adversarial Bandits with Arbitrary Delays , 2019, AISTATS.

[5]  Nicolò Cesa-Bianchi,et al.  Nonstochastic Multiarmed Bandits with Unrestricted Delays , 2019, NeurIPS.

[6]  Renyuan Xu,et al.  Learning in Generalized Linear Contextual Bandits with Stochastic Delays , 2019, NeurIPS.

[7]  Xi Chen,et al.  Online EXP3 Learning in Adversarial Bandits with Delayed Feedback , 2019, NeurIPS.

[8]  Claudio Gentile,et al.  Nonstochastic Bandits with Composite Anonymous Feedback , 2018, COLT.

[9]  Csaba Szepesvári,et al.  Bandits with Delayed, Aggregated Anonymous Feedback , 2017, ICML.

[10]  Vianney Perchet,et al.  Stochastic Bandit Models for Delayed Conversions , 2017, UAI.

[11]  Claudio Gentile,et al.  Delay and Cooperation in Nonstochastic Bandits , 2016, COLT.

[12]  András György,et al.  Online Learning under Delayed Feedback , 2013, ICML.

[13]  Andreas Krause,et al.  Parallelizing Exploration-Exploitation Tradeoffs with Gaussian Process Bandit Optimization , 2012, ICML.

[14]  John Langford,et al.  Efficient Optimal Learning for Contextual Bandits , 2011, UAI.

[15]  Peter Auer,et al.  UCB revisited: Improved regret bounds for the stochastic multi-armed bandit problem , 2010, Period. Math. Hung..

[16]  Robert D. Kleinberg,et al.  Regret bounds for sleeping experts and bandits , 2010, Machine Learning.

[17]  Shie Mannor,et al.  Action Elimination and Stopping Conditions for the Multi-Armed Bandit and Reinforcement Learning Problems , 2006, J. Mach. Learn. Res..

[18]  Imre Csiszár,et al.  Context tree estimation for not necessarily finite memory processes, via BIC and MDL , 2005, IEEE Transactions on Information Theory.

[19]  Peter Auer,et al.  Finite-time Analysis of the Multiarmed Bandit Problem , 2002, Machine Learning.

[20]  E. Ordentlich,et al.  On delayed prediction of individual sequences , 2002, Proceedings IEEE International Symposium on Information Theory,.

[21]  Peter Auer,et al.  The Nonstochastic Multiarmed Bandit Problem , 2002, SIAM J. Comput..