论文信息 - Exploration Driven by an Optimistic Bellman Equation

Exploration Driven by an Optimistic Bellman Equation

Exploring high-dimensional state spaces and finding sparse rewards are central problems in reinforcement learning. Exploration strategies are frequently either naïve (e.g., simplistic-greedy or Boltzmann policies), intractable (i.e., full Bayesian treatment of reinforcement learning) or rely heavily on heuristics. The lack of a tractable but principled exploration approach unnecessarily complicates the application of reinforcement learning to a broader range of problems. Efficient exploration can be accomplished by relying on the uncertainty of the state-action value function. To obtain the uncertainty, we maintain an ensemble of value function estimates and present an optimistic Bellman equation (OBE) for such ensembles. This OBE is derived from a relative entropy maximization principle and yields an implicit exploration bonus resulting in improved exploration during action selection. The implied exploration bonus can be seen as a well-principled type of intrinsic motivation and exhibits favorable theoretical properties. OBE can be applied to a wide range of algorithms. We propose two algorithms as an application of the principle: Optimistic Q-learning and Optimistic DQN which outperform comparison methods on standard benchmarks.

[1] Jürgen Schmidhuber,et al. Driven by Compression Progress: A Simple Principle Explains Essential Aspects of Subjective Beauty, Novelty, Surprise, Interestingness, Attention, Curiosity, Creativity, Art, Science, Music, Jokes , 2008, ABiALS.

[2] Stuart J. Russell,et al. Bayesian Q-Learning , 1998, AAAI/IAAI.

[3] Peter Dayan,et al. Q-learning , 1992, Machine Learning.

[4] Tom Schaul,et al. Unifying Count-Based Exploration and Intrinsic Motivation , 2016, NIPS.

[5] Peter Auer,et al. Near-optimal Regret Bounds for Reinforcement Learning , 2008, J. Mach. Learn. Res..

[6] Ronen I. Brafman,et al. R-MAX - A General Polynomial Time Algorithm for Near-Optimal Reinforcement Learning , 2001, J. Mach. Learn. Res..

[7] Shane Legg,et al. Human-level control through deep reinforcement learning , 2015, Nature.

[8] Jonathan P. How,et al. Sample Efficient Reinforcement Learning with Gaussian Processes , 2014, ICML.

[9] Andrzej Ruszczynski,et al. Risk-averse dynamic programming for Markov decision processes , 2010, Math. Program..

[10] Andrew W. Moore,et al. Reinforcement Learning: A Survey , 1996, J. Artif. Intell. Res..

[11] Vicenç Gómez,et al. A unified view of entropy-regularized Markov decision processes , 2017, ArXiv.

[12] András Lörincz,et al. The many faces of optimism: a unifying approach , 2008, ICML '08.

[13] Alec Radford,et al. Proximal Policy Optimization Algorithms , 2017, ArXiv.

[14] Yishay Mansour,et al. Convergence of Optimistic and Incremental Q-Learning , 2001, NIPS.

[15] Richard S. Sutton,et al. Reinforcement Learning: An Introduction , 1998, IEEE Trans. Neural Networks.

[16] Benjamin Van Roy,et al. (More) Efficient Reinforcement Learning via Posterior Sampling , 2013, NIPS.

[17] T. L. Lai Andherbertrobbins. Asymptotically Efficient Adaptive Allocation Rules , 2022 .

[18] Malcolm J. A. Strens,et al. A Bayesian Framework for Reinforcement Learning , 2000, ICML.

[19] Nuttapong Chentanez,et al. Intrinsically Motivated Reinforcement Learning , 2004, NIPS.

[20] Shie Mannor,et al. Reinforcement learning with Gaussian processes , 2005, ICML.

[21] Michael Kearns,et al. Near-Optimal Reinforcement Learning in Polynomial Time , 2002, Machine Learning.

[22] Ian Osband,et al. The Uncertainty Bellman Equation and Exploration , 2017, ICML.

[23] Benjamin Van Roy,et al. Deep Exploration via Bootstrapped DQN , 2016, NIPS.

[24] Paul Bourgine,et al. Exploration of Multi-State Environments: Local Measures and Back-Propagation of Uncertainty , 1999, Machine Learning.

[25] Alexei A. Efros,et al. Curiosity-Driven Exploration by Self-Supervised Prediction , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition Workshops (CVPRW).

[26] P. Cochat,et al. Et al , 2008, Archives de pediatrie : organe officiel de la Societe francaise de pediatrie.

[27] Yishay Mansour,et al. Policy Gradient Methods for Reinforcement Learning with Function Approximation , 1999, NIPS.

[28] Daniel Hernández-Hernández,et al. Risk Sensitive Markov Decision Processes , 1997 .

[29] Kavosh Asadi,et al. An Alternative Softmax Operator for Reinforcement Learning , 2016, ICML.

[30] Richard S. Sutton,et al. Generalization in Reinforcement Learning: Successful Examples Using Sparse Coarse Coding , 1995, NIPS.

[31] Sergey Levine,et al. Trust Region Policy Optimization , 2015, ICML.

[32] D. Opitz,et al. Popular Ensemble Methods: An Empirical Study , 1999, J. Artif. Intell. Res..

[33] Michael I. Jordan,et al. MASSACHUSETTS INSTITUTE OF TECHNOLOGY ARTIFICIAL INTELLIGENCE LABORATORY and CENTER FOR BIOLOGICAL AND COMPUTATIONAL LEARNING DEPARTMENT OF BRAIN AND COGNITIVE SCIENCES , 1996 .

[34] Wojciech Zaremba,et al. OpenAI Gym , 2016, ArXiv.

[35] Kamyar Azizzadenesheli,et al. Efficient Exploration Through Bayesian Deep Q-Networks , 2018, 2018 Information Theory and Applications Workshop (ITA).

[36] Alex Graves,et al. Asynchronous Methods for Deep Reinforcement Learning , 2016, ICML.

[37] Martha White,et al. Interval Estimation for Reinforcement-Learning Algorithms in Continuous-State Domains , 2010, NIPS.

[38] Pascal Poupart,et al. Bayesian Reinforcement Learning , 2010, Encyclopedia of Machine Learning.

[39] Yasemin Altun,et al. Relative Entropy Policy Search , 2010 .