论文信息 - Bounded Regret for Finite-Armed Structured Bandits

Bounded Regret for Finite-Armed Structured Bandits

We study a new type of K-armed bandit problem where the expected return of one arm may depend on the returns of other arms. We present a new algorithm for this general class of problems and show that under certain circumstances it is possible to achieve finite expected cumulative regret. We also give problem-dependent lower bounds on the cumulative regret showing that at least in special cases the new algorithm is nearly optimal.

Tor Lattimore | Rémi Munos | R. Munos | Tor Lattimore

[1] Umar Syed,et al. Bandits, Query Learning, and the Haystack Dimension , 2011, COLT.

[2] T. L. Lai Andherbertrobbins. Asymptotically Efficient Adaptive Allocation Rules , 2022 .

[3] Sébastien Bubeck,et al. Prior-free and prior-dependent regret bounds for Thompson Sampling , 2013, 2014 48th Annual Conference on Information Sciences and Systems (CISS).

[4] Shipra Agrawal,et al. Analysis of Thompson Sampling for the Multi-armed Bandit Problem , 2011, COLT.

[5] Shipra Agrawal,et al. Further Optimal Regret Bounds for Thompson Sampling , 2012, AISTATS.

[6] T. L. Graves,et al. Asymptotically Efficient Adaptive Choice of Control Laws inControlled Markov Chains , 1997 .

[7] Rémi Munos,et al. Thompson Sampling: An Asymptotically Optimal Finite-Time Analysis , 2012, ALT.

[8] Peter Auer,et al. UCB revisited: Improved regret bounds for the stochastic multi-armed bandit problem , 2010, Period. Math. Hung..

[9] Sébastien Bubeck,et al. Regret Analysis of Stochastic and Nonstochastic Multi-armed Bandit Problems , 2012, Found. Trends Mach. Learn..

[10] Rémi Munos,et al. Thompson Sampling for 1-Dimensional Exponential Family Bandits , 2013, NIPS.

[11] Csaba Szepesvári,et al. Variance estimates and exploration function in multi-armed bandit , 2008 .

[12] Vianney Perchet,et al. Bounded regret in stochastic multi-armed bandits , 2013, COLT.

[13] Peter Auer,et al. Finite-time Analysis of the Multiarmed Bandit Problem , 2002, Machine Learning.

[14] John N. Tsitsiklis,et al. A Structured Multiarmed Bandit Problem and the Greedy Policy , 2008, IEEE Transactions on Automatic Control.

[15] Benjamin Van Roy,et al. Eluder Dimension and the Sample Complexity of Optimistic Exploration , 2013, NIPS.

[16] Alessandro Lazaric,et al. Sequential Transfer in Multi-armed Bandit with Finite Set of Models , 2013, NIPS.

[17] R. Agrawal,et al. Asymptotically efficient adaptive allocation schemes for controlled Markov chains: finite parameter space , 1989 .

[18] H. Robbins,et al. Optimal sequential sampling from two populations. , 1984, Proceedings of the National Academy of Sciences of the United States of America.

[19] Csaba Szepesvári,et al. Online Optimization in X-Armed Bandits , 2008, NIPS.