On the Theory of Reinforcement Learning with Once-per-Episode Feedback

We study a theory of reinforcement learning (RL) in which the learner receives binary feedback only once at the end of an episode. While this is an extreme test case for theory, it is also arguably more representative of real-world applications than the traditional requirement in RL practice that the learner receive feedback at every time step. Indeed, in many real-world applications of reinforcement learning, such as self-driving cars and robotics, it is easier to evaluate whether a learner’s complete trajectory was either “good” or “bad,” but harder to provide a reward signal at each step. To show that learning is possible in this more challenging setting, we study the case where trajectory labels are generated by an unknown parametric model, and provide a statistically and computationally efficient algorithm that achieves sub-linear regret.

[1]  Marc Abeille,et al.  Improved Optimistic Algorithms for Logistic Bandits , 2020, ICML.

[2]  Peter Auer,et al.  Near-optimal Regret Bounds for Reinforcement Learning , 2008, J. Mach. Learn. Res..

[3]  Shie Mannor,et al.  Confidence-Budget Matching for Sequential Budgeted Learning , 2021, ICML.

[4]  Shie Mannor,et al.  Tight Regret Bounds for Model-Based Reinforcement Learning with Greedy Policies , 2019, NeurIPS.

[5]  Mykel J. Kochenderfer,et al.  Provably Efficient Reward-Agnostic Navigation with Linear Value Iteration , 2020, NeurIPS.

[6]  Andrew Y. Ng,et al.  Pharmacokinetics of a novel formulation of ivermectin after administration to goats , 2000, ICML.

[7]  Tor Lattimore,et al.  Unifying PAC and Regret: Uniform PAC Bounds for Episodic Reinforcement Learning , 2017, NIPS.

[8]  Benjamin Van Roy,et al.  (More) Efficient Reinforcement Learning via Posterior Sampling , 2013, NIPS.

[9]  Eyke Hüllermeier,et al.  Preference-based reinforcement learning: a formal framework and a policy iteration algorithm , 2012, Mach. Learn..

[10]  Michèle Sebag,et al.  Preference-Based Policy Learning , 2011, ECML/PKDD.

[11]  Johannes Fürnkranz,et al.  A Survey of Preference-Based Reinforcement Learning Methods , 2017, J. Mach. Learn. Res..

[12]  Demis Hassabis,et al.  Mastering the game of Go without human knowledge , 2017, Nature.

[13]  Haim Kaplan,et al.  Online Markov Decision Processes with Aggregate Bandit Feedback , 2021, COLT.

[14]  Akshay Krishnamurthy,et al.  Reward-Free Exploration for Reinforcement Learning , 2020, ICML.

[15]  Jasjeet S. Sekhon,et al.  Time-uniform, nonparametric, nonasymptotic confidence sequences , 2020, The Annals of Statistics.

[16]  Shie Mannor,et al.  Thompson Sampling for Learning Parameterized Markov Decision Processes , 2014, COLT.

[17]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[18]  Benjamin Van Roy,et al.  On Lower Bounds for Regret in Reinforcement Learning , 2016, ArXiv.

[19]  Louis Faury,et al.  Self-Concordant Analysis of Generalized Linear Bandits with Forgetting , 2020, AISTATS.

[20]  Max Simchowitz,et al.  Non-Asymptotic Gap-Dependent Regret Bounds for Tabular MDPs , 2019, NeurIPS.

[21]  Demis Hassabis,et al.  Improved protein structure prediction using potentials from deep learning , 2020, Nature.

[22]  Aurélien Garivier,et al.  Parametric Bandits: The Generalized Linear Case , 2010, NIPS.

[23]  Csaba Szepesvari,et al.  Bandit Algorithms , 2020 .

[24]  P. Cochat,et al.  Et al , 2008, Archives de pediatrie : organe officiel de la Societe francaise de pediatrie.

[25]  Rémi Munos,et al.  Minimax Regret Bounds for Reinforcement Learning , 2017, ICML.

[26]  Ronald J. Williams,et al.  Simple Statistical Gradient-Following Algorithms for Connectionist Reinforcement Learning , 2004, Machine Learning.

[27]  Xin-She Yang,et al.  Introduction to Algorithms , 2021, Nature-Inspired Optimization Algorithms.

[28]  Csaba Szepesvári,et al.  Improved Algorithms for Linear Stochastic Bandits , 2011, NIPS.

[29]  Emma Brunskill,et al.  Tighter Problem-Dependent Regret Bounds in Reinforcement Learning without Domain Knowledge using Value Function Bounds , 2019, ICML.

[30]  Shane Legg,et al.  Human-level control through deep reinforcement learning , 2015, Nature.

[31]  Shane Legg,et al.  Deep Reinforcement Learning from Human Preferences , 2017, NIPS.

[32]  Shie Mannor,et al.  Reinforcement Learning with Trajectory Feedback , 2020, ArXiv.

[33]  Ruosong Wang,et al.  Preference-based Reinforcement Learning with Finite-Time Guarantees , 2020, NeurIPS.

[34]  Michael I. Jordan,et al.  Is Q-learning Provably Efficient? , 2018, NeurIPS.

[35]  J. Tropp FREEDMAN'S INEQUALITY FOR MATRIX MARTINGALES , 2011, 1101.3039.

[36]  Tengyu Ma,et al.  On the Performance of Thompson Sampling on Logistic Bandits , 2019, COLT.

[37]  Michèle Sebag,et al.  Programming by Feedback , 2014, ICML.

[38]  Benjamin Van Roy,et al.  Why is Posterior Sampling Better than Optimism for Reinforcement Learning? , 2016, ICML.

[39]  Krzysztof Choromanski,et al.  On Optimism in Model-Based Reinforcement Learning , 2020, ArXiv.

[40]  Sergey Levine,et al.  End-to-End Training of Deep Visuomotor Policies , 2015, J. Mach. Learn. Res..

[41]  Joel W. Burdick,et al.  Dueling Posterior Sampling for Preference-Based Reinforcement Learning , 2019, UAI.