论文信息 - Stochastic Multi-Armed Bandits with Unrestricted Delay Distributions - 字舞流文

Stochastic Multi-Armed Bandits with Unrestricted Delay Distributions

We study the stochastic Multi-Armed Bandit (MAB) problem with random delays in the feedback received by the algorithm. We consider two settings: the reward-dependent delay setting, where realized delays may depend on the stochastic rewards, and the reward-independent delay setting. Our main contribution is algorithms that achieve near-optimal regret in each of the settings, with an additional additive dependence on the quantiles of the delay distribution. Our results do not make any assumptions on the delay distributions: in particular, we do not assume they come from any parametric family of distributions and allow for unbounded support and expectation; we further allow for infinite delays where the algorithm might occasionally not observe any feedback.

Yishay Mansour | Tomer Koren | Shahar Segal | Tal Lancewicki | Y. Mansour | Tomer Koren | Tal Lancewicki | Shahar Segal

[1] Pooria Joulani,et al. Adapting to Delays and Data in Adversarial Multi-Armed Bandits , 2020, ICML.

[2] Csaba Szepesvari,et al. Bandit Algorithms , 2020 .

[3] Michal Valko,et al. Stochastic bandits with arm-dependent delays , 2020, ICML.

[4] Julian Zimmert,et al. An Optimal Algorithm for Adversarial Bandits with Arbitrary Delays , 2019, AISTATS.

[5] Nicolò Cesa-Bianchi,et al. Nonstochastic Multiarmed Bandits with Unrestricted Delays , 2019, NeurIPS.

[6] Renyuan Xu,et al. Learning in Generalized Linear Contextual Bandits with Stochastic Delays , 2019, NeurIPS.

[7] Xi Chen,et al. Online EXP3 Learning in Adversarial Bandits with Delayed Feedback , 2019, NeurIPS.

[8] Claudio Gentile,et al. Nonstochastic Bandits with Composite Anonymous Feedback , 2018, COLT.

[9] Csaba Szepesvári,et al. Bandits with Delayed, Aggregated Anonymous Feedback , 2017, ICML.

[10] Vianney Perchet,et al. Stochastic Bandit Models for Delayed Conversions , 2017, UAI.

[11] Claudio Gentile,et al. Delay and Cooperation in Nonstochastic Bandits , 2016, COLT.

[12] András György,et al. Online Learning under Delayed Feedback , 2013, ICML.

[13] Andreas Krause,et al. Parallelizing Exploration-Exploitation Tradeoffs with Gaussian Process Bandit Optimization , 2012, ICML.

[14] John Langford,et al. Efficient Optimal Learning for Contextual Bandits , 2011, UAI.

[15] Peter Auer,et al. UCB revisited: Improved regret bounds for the stochastic multi-armed bandit problem , 2010, Period. Math. Hung..

[16] Robert D. Kleinberg,et al. Regret bounds for sleeping experts and bandits , 2010, Machine Learning.

[17] Shie Mannor,et al. Action Elimination and Stopping Conditions for the Multi-Armed Bandit and Reinforcement Learning Problems , 2006, J. Mach. Learn. Res..

[18] Imre Csiszár,et al. Context tree estimation for not necessarily finite memory processes, via BIC and MDL , 2005, IEEE Transactions on Information Theory.

[19] Peter Auer,et al. Finite-time Analysis of the Multiarmed Bandit Problem , 2002, Machine Learning.

[20] E. Ordentlich,et al. On delayed prediction of individual sequences , 2002, Proceedings IEEE International Symposium on Information Theory,.

[21] Peter Auer,et al. The Nonstochastic Multiarmed Bandit Problem , 2002, SIAM J. Comput..