Exploration Driven by an Optimistic Bellman Equation

Exploring high-dimensional state spaces and finding sparse rewards are central problems in reinforcement learning. Exploration strategies are frequently either naïve (e.g., simplistic-greedy or Boltzmann policies), intractable (i.e., full Bayesian treatment of reinforcement learning) or rely heavily on heuristics. The lack of a tractable but principled exploration approach unnecessarily complicates the application of reinforcement learning to a broader range of problems. Efficient exploration can be accomplished by relying on the uncertainty of the state-action value function. To obtain the uncertainty, we maintain an ensemble of value function estimates and present an optimistic Bellman equation (OBE) for such ensembles. This OBE is derived from a relative entropy maximization principle and yields an implicit exploration bonus resulting in improved exploration during action selection. The implied exploration bonus can be seen as a well-principled type of intrinsic motivation and exhibits favorable theoretical properties. OBE can be applied to a wide range of algorithms. We propose two algorithms as an application of the principle: Optimistic Q-learning and Optimistic DQN which outperform comparison methods on standard benchmarks.

[1]  Jürgen Schmidhuber,et al.  Driven by Compression Progress: A Simple Principle Explains Essential Aspects of Subjective Beauty, Novelty, Surprise, Interestingness, Attention, Curiosity, Creativity, Art, Science, Music, Jokes , 2008, ABiALS.

[2]  Stuart J. Russell,et al.  Bayesian Q-Learning , 1998, AAAI/IAAI.

[3]  Peter Dayan,et al.  Q-learning , 1992, Machine Learning.

[4]  Tom Schaul,et al.  Unifying Count-Based Exploration and Intrinsic Motivation , 2016, NIPS.

[5]  Peter Auer,et al.  Near-optimal Regret Bounds for Reinforcement Learning , 2008, J. Mach. Learn. Res..

[6]  Ronen I. Brafman,et al.  R-MAX - A General Polynomial Time Algorithm for Near-Optimal Reinforcement Learning , 2001, J. Mach. Learn. Res..

[7]  Shane Legg,et al.  Human-level control through deep reinforcement learning , 2015, Nature.

[8]  Jonathan P. How,et al.  Sample Efficient Reinforcement Learning with Gaussian Processes , 2014, ICML.

[9]  Andrzej Ruszczynski,et al.  Risk-averse dynamic programming for Markov decision processes , 2010, Math. Program..

[10]  Andrew W. Moore,et al.  Reinforcement Learning: A Survey , 1996, J. Artif. Intell. Res..

[11]  Vicenç Gómez,et al.  A unified view of entropy-regularized Markov decision processes , 2017, ArXiv.

[12]  András Lörincz,et al.  The many faces of optimism: a unifying approach , 2008, ICML '08.

[13]  Alec Radford,et al.  Proximal Policy Optimization Algorithms , 2017, ArXiv.

[14]  Yishay Mansour,et al.  Convergence of Optimistic and Incremental Q-Learning , 2001, NIPS.

[15]  Richard S. Sutton,et al.  Reinforcement Learning: An Introduction , 1998, IEEE Trans. Neural Networks.

[16]  Benjamin Van Roy,et al.  (More) Efficient Reinforcement Learning via Posterior Sampling , 2013, NIPS.

[17]  T. L. Lai Andherbertrobbins Asymptotically Efficient Adaptive Allocation Rules , 2022 .

[18]  Malcolm J. A. Strens,et al.  A Bayesian Framework for Reinforcement Learning , 2000, ICML.

[19]  Nuttapong Chentanez,et al.  Intrinsically Motivated Reinforcement Learning , 2004, NIPS.

[20]  Shie Mannor,et al.  Reinforcement learning with Gaussian processes , 2005, ICML.

[21]  Michael Kearns,et al.  Near-Optimal Reinforcement Learning in Polynomial Time , 2002, Machine Learning.

[22]  Ian Osband,et al.  The Uncertainty Bellman Equation and Exploration , 2017, ICML.

[23]  Benjamin Van Roy,et al.  Deep Exploration via Bootstrapped DQN , 2016, NIPS.

[24]  Paul Bourgine,et al.  Exploration of Multi-State Environments: Local Measures and Back-Propagation of Uncertainty , 1999, Machine Learning.

[25]  Alexei A. Efros,et al.  Curiosity-Driven Exploration by Self-Supervised Prediction , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition Workshops (CVPRW).

[26]  P. Cochat,et al.  Et al , 2008, Archives de pediatrie : organe officiel de la Societe francaise de pediatrie.

[27]  Yishay Mansour,et al.  Policy Gradient Methods for Reinforcement Learning with Function Approximation , 1999, NIPS.

[28]  Daniel Hernández-Hernández,et al.  Risk Sensitive Markov Decision Processes , 1997 .

[29]  Kavosh Asadi,et al.  An Alternative Softmax Operator for Reinforcement Learning , 2016, ICML.

[30]  Richard S. Sutton,et al.  Generalization in Reinforcement Learning: Successful Examples Using Sparse Coarse Coding , 1995, NIPS.

[31]  Sergey Levine,et al.  Trust Region Policy Optimization , 2015, ICML.

[32]  D. Opitz,et al.  Popular Ensemble Methods: An Empirical Study , 1999, J. Artif. Intell. Res..

[33]  Michael I. Jordan,et al.  MASSACHUSETTS INSTITUTE OF TECHNOLOGY ARTIFICIAL INTELLIGENCE LABORATORY and CENTER FOR BIOLOGICAL AND COMPUTATIONAL LEARNING DEPARTMENT OF BRAIN AND COGNITIVE SCIENCES , 1996 .

[34]  Wojciech Zaremba,et al.  OpenAI Gym , 2016, ArXiv.

[35]  Kamyar Azizzadenesheli,et al.  Efficient Exploration Through Bayesian Deep Q-Networks , 2018, 2018 Information Theory and Applications Workshop (ITA).

[36]  Alex Graves,et al.  Asynchronous Methods for Deep Reinforcement Learning , 2016, ICML.

[37]  Martha White,et al.  Interval Estimation for Reinforcement-Learning Algorithms in Continuous-State Domains , 2010, NIPS.

[38]  Pascal Poupart,et al.  Bayesian Reinforcement Learning , 2010, Encyclopedia of Machine Learning.

[39]  Yasemin Altun,et al.  Relative Entropy Policy Search , 2010 .