论文信息 - Tight Regret Bounds for Model-Based Reinforcement Learning with Greedy Policies - 字舞流文

Tight Regret Bounds for Model-Based Reinforcement Learning with Greedy Policies

State-of-the-art efficient model-based Reinforcement Learning (RL) algorithms typically act by iteratively solving empirical models, i.e., by performing \emph{full-planning} on Markov Decision Processes (MDPs) built by the gathered experience. In this paper, we focus on model-based RL in the finite-state finite-horizon MDP setting and establish that exploring with \emph{greedy policies} -- act by \emph{1-step planning} -- can achieve tight minimax performance in terms of regret, $\tilde{\mathcal{O}}(\sqrt{HSAT})$. Thus, full-planning in model-based RL can be avoided altogether without any performance degradation, and, by doing so, the computational complexity decreases by a factor of $S$. The results are based on a novel analysis of real-time dynamic programming, then extended to model-based RL. Specifically, we generalize existing algorithms that perform full-planning to such that act by 1-step planning. For these generalizations, we prove regret bounds with the same rate as their full-planning counterparts.

Shie Mannor | Mohammad Ghavamzadeh | Yonathan Efroni | Nadav Merlis | Shie Mannor | M. Ghavamzadeh | Yonathan Efroni | Nadav Merlis

[1] Richard S. Sutton,et al. Dyna, an integrated architecture for learning, planning, and reacting , 1990, SGAR.

[2] Andrew G. Barto,et al. Learning to Act Using Real-Time Dynamic Programming , 1995, Artif. Intell..

[3] T. Lai,et al. Self-Normalized Processes: Limit Theory and Statistical Applications , 2001 .

[4] Ronen I. Brafman,et al. R-MAX - A General Polynomial Time Algorithm for Near-Optimal Reinforcement Learning , 2001, J. Mach. Learn. Res..

[5] E. Ordentlich,et al. Inequalities for the L1 Deviation of the Empirical Distribution , 2003 .

[6] Blai Bonet,et al. Labeled RTDP: Improving the Convergence of Real-Time Dynamic Programming , 2003, ICAPS.

[7] Andrew W. Moore,et al. Prioritized sweeping: Reinforcement learning with less data and less time , 2004, Machine Learning.

[8] Michael Kearns,et al. Near-Optimal Reinforcement Learning in Polynomial Time , 1998, Machine Learning.

[9] Geoffrey J. Gordon,et al. Bounded real-time dynamic programming: RTDP with monotone upper bounds and performance guarantees , 2005, ICML.

[10] Richard S. Sutton,et al. Reinforcement Learning: An Introduction , 1998, IEEE Trans. Neural Networks.

[11] Alexander L. Strehl,et al. PAC Reinforcement Learning Bounds for RTDP and Rand-RTDP Technical Report , 2006 .

[12] Reid G. Simmons,et al. Focused Real-Time Dynamic Programming for MDPs: Squeezing More Out of a Heuristic , 2006, AAAI.

[13] Lihong Li,et al. Incremental Model-based Learners With Formal Learning-Time Guarantees , 2006, UAI.

[14] T. Lai,et al. Pseudo-maximization and self-normalized processes , 2007, 0709.2233.

[15] Peter Auer,et al. Near-optimal Regret Bounds for Reinforcement Learning , 2008, J. Mach. Learn. Res..

[16] Michael L. Littman,et al. An analysis of model-based Interval Estimation for Markov Decision Processes , 2008, J. Comput. Syst. Sci..

[17] Massimiliano Pontil,et al. Empirical Bernstein Bounds and Sample-Variance Penalization , 2009, COLT.

[18] Ambuj Tewari,et al. REGAL: A Regularization based Algorithm for Reinforcement Learning in Weakly Communicating MDPs , 2009, UAI.

[19] Richard S. Sutton,et al. Planning by Prioritized Sweeping with Small Backups , 2013, ICML.

[20] Benjamin Van Roy,et al. (More) Efficient Reinforcement Learning via Posterior Sampling , 2013, NIPS.

[21] Shie Mannor,et al. Thompson Sampling for Learning Parameterized Markov Decision Processes , 2014, COLT.

[22] Shane Legg,et al. Human-level control through deep reinforcement learning , 2015, Nature.

[23] Benjamin Van Roy,et al. On Lower Bounds for Regret in Reinforcement Learning , 2016, ArXiv.

[24] Pieter Abbeel,et al. Value Iteration Networks , 2016, NIPS.

[25] Tor Lattimore,et al. Unifying PAC and Regret: Uniform PAC Bounds for Episodic Reinforcement Learning , 2017, NIPS.

[26] Rémi Munos,et al. Minimax Regret Bounds for Reinforcement Learning , 2017, ICML.

[27] Benjamin Van Roy,et al. Why is Posterior Sampling Better than Optimism for Reinforcement Learning? , 2016, ICML.

[28] Shipra Agrawal,et al. Optimistic posterior sampling for reinforcement learning: worst-case regret bounds , 2022, NIPS.

[29] Han Liu,et al. Feedback-Based Tree Search for Reinforcement Learning , 2018, ICML.

[30] Mohammad Sadegh Talebi,et al. Variance-Aware Regret Bounds for Undiscounted Reinforcement Learning in MDPs , 2018, ALT.

[31] Shie Mannor,et al. Multiple-Step Greedy Policies in Approximate and Online Reinforcement Learning , 2018, NeurIPS.

[32] Alessandro Lazaric,et al. Efficient Bias-Span-Constrained Exploration-Exploitation in Reinforcement Learning , 2018, ICML.

[33] Kam-Fai Wong,et al. Integrating planning for task-completion dialogue policy learning , 2018, ACL.

[34] Michael I. Jordan,et al. Is Q-learning Provably Efficient? , 2018, NeurIPS.

[35] Shie Mannor,et al. How to Combine Tree-Search Methods in Reinforcement Learning , 2018, AAAI.

[36] Emma Brunskill,et al. Tighter Problem-Dependent Regret Bounds in Reinforcement Learning without Domain Knowledge using Value Function Bounds , 2019, ICML.

[37] Lihong Li,et al. Policy Certificates: Towards Accountable Reinforcement Learning , 2018, ICML.

[38] Daniel Russo,et al. Worst-Case Regret Bounds for Exploration via Randomized Value Functions , 2019, NeurIPS.

[39] Ruben Villegas,et al. Learning Latent Dynamics for Planning from Pixels , 2018, ICML.

[40] T. L. Lai Andherbertrobbins. Asymptotically Efficient Adaptive Allocation Rules , 2022 .