论文信息 - OSOM: A Simultaneously Optimal Algorithm for Multi-Armed and Linear Contextual Bandits - 字舞流文

OSOM: A Simultaneously Optimal Algorithm for Multi-Armed and Linear Contextual Bandits

We consider the stochastic linear (multi-armed) contextual bandit problem with the possibility of hidden \textit{simple multi-armed bandit} structure in which the rewards are independent of the contextual information. Algorithms that are designed solely for one of the regimes are known to be sub-optimal for their alternate regime. We design a single computationally efficient algorithm that simultaneously obtains problem-dependent optimal regret rates in the simple multi-armed bandit regime and minimax optimal regret rates in the linear contextual bandit regime, without knowing a priori which of the two models generates the rewards. These results are proved under the condition of stochasticity of contextual information over multiple rounds. Our results should be viewed as a step towards principled data-dependent policy class selection for contextual bandits.

Peter L. Bartlett | Niladri S. Chatterji | Vidya Muthukumar | P. Bartlett | Vidya Muthukumar

[1] T. L. Lai Andherbertrobbins. Asymptotically Efficient Adaptive Allocation Rules , 2022 .

[2] Karthik Sridharan,et al. BISTRO: An Efficient Relaxation-Based Method for Contextual Bandits , 2016, ICML.

[3] Vianney Perchet,et al. Anytime optimal algorithms in stochastic multi-armed bandits , 2016, ICML.

[4] John Langford,et al. Taming the Monster: A Fast and Simple Algorithm for Contextual Bandits , 2014, ICML.

[5] Haipeng Luo,et al. Improved Regret Bounds for Oracle-Based Adversarial Contextual Bandits , 2016, NIPS.

[6] J. Tropp. FREEDMAN'S INEQUALITY FOR MATRIX MARTINGALES , 2011, 1101.3039.

[7] Ambuj Tewari,et al. From Ads to Interventions: Contextual Bandits in Mobile Health , 2017, Mobile Health - Sensors, Analytic Methods, and Applications.

[8] Aleksandrs Slivkins,et al. Contextual Bandits with Similarity Information , 2009, COLT.

[9] L. Ralaivola,et al. Empirical Bernstein Inequality for Martingales : Application to Online Learning , 2013 .

[10] John Langford,et al. Contextual Bandit Algorithms with Supervised Learning Guarantees , 2010, AISTATS.

[11] R. Oliveira. Concentration of the adjacency matrix and of the Laplacian in random graphs with independent edges , 2009, 0911.0600.

[12] John Langford,et al. Contextual Bandits with Continuous Actions: Smoothing, Zooming, and Adapting , 2019, COLT.

[13] Haipeng Luo,et al. Model selection for contextual bandits , 2019, NeurIPS.

[14] Aleksandrs Slivkins,et al. 25th Annual Conference on Learning Theory The Best of Both Worlds: Stochastic and Adversarial Bandits , 2022 .

[15] M. Woodroofe. A One-Armed Bandit Problem with a Concomitant Variable , 1979 .

[16] Zhiwei Steven Wu,et al. The Externalities of Exploration and How Data Diversity Helps Exploitation , 2018, COLT.

[17] Csaba Szepesvári,et al. Improved Algorithms for Linear Stochastic Bandits , 2011, NIPS.

[18] Khashayar Khosravi,et al. Mostly Exploration-Free Algorithms for Contextual Bandits , 2017, Manag. Sci..

[19] Sampath Kannan,et al. A Smoothed Analysis of the Greedy Algorithm for the Linear Contextual Bandit Problem , 2018, NeurIPS.

[20] Martin J. Wainwright,et al. High-Dimensional Statistics , 2019 .

[21] Peter Auer,et al. Finite-time Analysis of the Multiarmed Bandit Problem , 2002, Machine Learning.

[22] Akshay Krishnamurthy,et al. Contextual bandits with surrogate losses: Margin bounds and efficient algorithms , 2018, NeurIPS.

[23] Wei Chu,et al. A contextual-bandit approach to personalized news article recommendation , 2010, WWW '10.

[24] Matthew J. Streeter,et al. Tighter Bounds for Multi-Armed Bandits with Expert Advice , 2009, COLT.

[25] John Langford,et al. The Epoch-Greedy Algorithm for Multi-armed Bandits with Side Information , 2007, NIPS.

[26] Wei Chu,et al. Contextual Bandits with Linear Payoff Functions , 2011, AISTATS.

[27] Akshay Krishnamurthy,et al. Efficient Algorithms for Adversarial Contextual Learning , 2016, ICML.

[28] Éva Tardos,et al. Learning in Games: Robustness of Fast Convergence , 2016, NIPS.

[29] Haipeng Luo,et al. Corralling a Band of Bandit Algorithms , 2016, COLT.

[30] Roman Vershynin,et al. High-Dimensional Probability , 2018 .

[31] Jean-Yves Audibert,et al. Regret Bounds and Minimax Policies under Partial Monitoring , 2010, J. Mach. Learn. Res..

[32] John Langford,et al. Making Contextual Decisions with Low Technical Debt , 2016 .