Confidence-Budget Matching for Sequential Budgeted Learning

A core element in decision-making under uncertainty is the feedback on the quality of the performed actions. However, in many applications, such feedback is restricted. For example, in recommendation systems, repeatedly asking the user to provide feedback on the quality of recommendations will annoy them. In this work, we formalize decision-making problems with querying budget, where there is a (possibly time-dependent) hard limit on the number of reward queries allowed. Specifically, we consider multi-armed bandits, linear bandits, and reinforcement learning problems. We start by analyzing the performance of ‘greedy’ algorithms that query a reward whenever they can. We show that in fully stochastic settings, doing so performs surprisingly well, but in the presence of any adversity, this might lead to linear regret. To overcome this issue, we propose the Confidence-Budget Matching (CBM) principle that queries rewards when the confidence intervals are wider than the inverse square root of the available budget. We analyze the performance of CBM based algorithms in different settings and show that they perform well in the presence of adversity in the contexts, initial states, and budgets.

[1]  Shie Mannor,et al.  Reinforcement Learning with Trajectory Feedback , 2020, ArXiv.

[2]  Csaba Szepesvári,et al.  Improved Algorithms for Linear Stochastic Bandits , 2011, NIPS.

[3]  Jinwoo Shin,et al.  Multi-armed Bandit with Additional Observations , 2018, Proc. ACM Meas. Anal. Comput. Syst..

[4]  Lihong Li,et al.  Policy Certificates: Towards Accountable Reinforcement Learning , 2018, ICML.

[5]  Aurélien Garivier,et al.  Explore First, Exploit Next: The True Shape of Regret in Bandit Problems , 2016, Math. Oper. Res..

[6]  Aurélien Garivier,et al.  The KL-UCB Algorithm for Bounded Stochastic Bandits and Beyond , 2011, COLT.

[7]  Xiangyang Ji,et al.  Is Reinforcement Learning More Difficult Than Bandits? A Near-optimal Algorithm Escaping the Curse of Horizon , 2021, COLT.

[8]  Demis Hassabis,et al.  Mastering the game of Go without human knowledge , 2017, Nature.

[9]  Shie Mannor,et al.  Batch-Size Independent Regret Bounds for the Combinatorial Multi-Armed Bandit Problem , 2019, COLT.

[10]  Aleksandrs Slivkins,et al.  Bandits with Knapsacks , 2013, 2013 IEEE 54th Annual Symposium on Foundations of Computer Science.

[11]  David Simchi-Levi,et al.  Instance-Dependent Complexity of Contextual Bandits and Reinforcement Learning: A Disagreement-Based Perspective , 2020, COLT.

[12]  Shie Mannor,et al.  Tight Regret Bounds for Model-Based Reinforcement Learning with Greedy Policies , 2019, NeurIPS.

[13]  Max Simchowitz,et al.  Non-Asymptotic Gap-Dependent Regret Bounds for Tabular MDPs , 2019, NeurIPS.

[14]  Haipeng Luo,et al.  Model selection for contextual bandits , 2019, NeurIPS.

[15]  Alessandro Lazaric,et al.  No-Regret Exploration in Goal-Oriented Reinforcement Learning , 2020, ICML.

[16]  Alessandro Lazaric,et al.  Linear Thompson Sampling Revisited , 2016, AISTATS.

[17]  Vianney Perchet,et al.  Anytime optimal algorithms in stochastic multi-armed bandits , 2016, ICML.

[18]  Emma Brunskill,et al.  Tighter Problem-Dependent Regret Bounds in Reinforcement Learning without Domain Knowledge using Value Function Bounds , 2019, ICML.

[19]  Koby Crammer,et al.  Prediction with Limited Advice and Multiarmed Bandits with Paid Observations , 2014, ICML.

[20]  Shie Mannor,et al.  Tight Lower Bounds for Combinatorial Multi-Armed Bandits , 2020, COLT.

[21]  Haim Kaplan,et al.  Near-optimal Regret Bounds for Stochastic Shortest Path , 2020, ICML.

[22]  Shie Mannor,et al.  How hard is my MDP?" The distribution-norm to the rescue" , 2014, NIPS.

[23]  Thomas P. Hayes,et al.  Stochastic Linear Optimization under Bandit Feedback , 2008, COLT.

[24]  Peter Auer,et al.  Finite-time Analysis of the Multiarmed Bandit Problem , 2002, Machine Learning.

[25]  L. Schmetterer Zeitschrift fur Wahrscheinlichkeitstheorie und Verwandte Gebiete. , 1963 .

[26]  Vashist Avadhanula,et al.  Multi-armed Bandits with Cost Subsidy , 2021, AISTATS.

[27]  Haipeng Luo,et al.  Learning Adversarial MDPs with Bandit Feedback and Unknown Transition , 2019, ArXiv.

[28]  Michael I. Jordan,et al.  Is Q-learning Provably Efficient? , 2018, NeurIPS.

[29]  Shane Legg,et al.  Human-level control through deep reinforcement learning , 2015, Nature.

[30]  J. Bretagnolle,et al.  Estimation des densités: risque minimax , 1978 .

[31]  Shipra Agrawal,et al.  Thompson Sampling for Contextual Bandits with Linear Payoffs , 2012, ICML.

[32]  Massimiliano Pontil,et al.  Empirical Bernstein Bounds and Sample-Variance Penalization , 2009, COLT.

[33]  Rémi Munos,et al.  Minimax Regret Bounds for Reinforcement Learning , 2017, ICML.

[34]  Shipra Agrawal,et al.  Analysis of Thompson Sampling for the Multi-armed Bandit Problem , 2011, COLT.