Unifying Count-Based Exploration and Intrinsic Motivation

We consider an agent's uncertainty about its environment and the problem of generalizing this uncertainty across observations. Specifically, we focus on the problem of exploration in non-tabular reinforcement learning. Drawing inspiration from the intrinsic motivation literature, we use density models to measure uncertainty, and propose a novel algorithm for deriving a pseudo-count from an arbitrary density model. This technique enables us to generalize count-based exploration algorithms to the non-tabular case. We apply our ideas to Atari 2600 games, providing sensible pseudo-counts from raw pixels. We transform these pseudo-counts into intrinsic rewards and obtain significantly improved exploration in a number of hard games, including the infamously difficult Montezuma's Revenge.

[1]  R. W. White Motivation reconsidered: the concept of competence. , 1959, Psychological review.

[2]  R. Bellman Dynamic programming. , 1957, Science.

[3]  Jürgen Schmidhuber,et al.  A possibility for implementing curiosity and boredom in model-building neural controllers , 1991 .

[4]  Thomas M. Cover,et al.  Elements of Information Theory , 1991 .

[5]  Sebastian Thrun,et al.  The role of exploration in learning control , 1992 .

[6]  Donald A. Sofge,et al.  Handbook of Intelligent Control: Neural, Fuzzy, and Adaptive Approaches , 1992 .

[7]  John N. Tsitsiklis,et al.  Neuro-Dynamic Programming , 1996, Encyclopedia of Machine Learning.

[8]  Yoram Singer,et al.  Efficient Bayesian Parameter Estimation in Large Discrete Domains , 1998, NIPS.

[9]  Stuart J. Russell,et al.  Bayesian Q-Learning , 1998, AAAI/IAAI.

[10]  Yishay Mansour,et al.  Convergence of Optimistic and Incremental Q-Learning , 2001, NIPS.

[11]  Andrew G. Barto,et al.  Optimal learning: computational procedures for bayes-adaptive markov decision processes , 2002 .

[12]  Ronen I. Brafman,et al.  R-MAX - A General Polynomial Time Algorithm for Near-Optimal Reinforcement Learning , 2001, J. Mach. Learn. Res..

[13]  Nuttapong Chentanez,et al.  Intrinsically Motivated Reinforcement Learning , 2004, NIPS.

[14]  Mark B. Ring CHILD: A First Step Towards Continual Learning , 1997, Machine Learning.

[15]  Marcus Hutter Universal Artificial Intellegence - Sequential Decisions Based on Algorithmic Probability , 2005, Texts in Theoretical Computer Science. An EATCS Series.

[16]  Andrew G. Barto,et al.  An intrinsic reward mechanism for efficient exploration , 2006, ICML.

[17]  Jesse Hoey,et al.  An analytic solution to discrete Bayesian reinforcement learning , 2006, ICML.

[18]  Pierre-Yves Oudeyer,et al.  Intrinsic Motivation Systems for Autonomous Mental Development , 2007, IEEE Transactions on Evolutionary Computation.

[19]  Peter Auer,et al.  Near-optimal Regret Bounds for Reinforcement Learning , 2008, J. Mach. Learn. Res..

[20]  Michael L. Littman,et al.  An analysis of model-based Interval Estimation for Markov Decision Processes , 2008, J. Comput. Syst. Sci..

[21]  Andre Cohen,et al.  An object-oriented representation for efficient reinforcement learning , 2008, ICML '08.

[22]  Jürgen Schmidhuber Driven by Compression Progress , 2008, KES.

[23]  Michael Bowling,et al.  Dual Representations for Dynamic Programming , 2008 .

[24]  Michael I. Jordan,et al.  Graphical Models, Exponential Families, and Variational Inference , 2008, Found. Trends Mach. Learn..

[25]  Andrew Y. Ng,et al.  Near-Bayesian exploration in polynomial time , 2009, ICML '09.

[26]  Csaba Szepesvári,et al.  Model-based reinforcement learning with nearly tight exploration complexity bounds , 2010, ICML.

[27]  Hilbert J. Kappen,et al.  Speedy Q-Learning , 2011, NIPS.

[28]  Doina Precup,et al.  An information-theoretic approach to curiosity-driven reinforcement learning , 2012, Theory in Biosciences.

[29]  Tom Schaul,et al.  Curiosity-driven optimization , 2011, 2011 IEEE Congress of Evolutionary Computation (CEC).

[30]  Olivier Buffet,et al.  Near-Optimal BRL using Optimistic Local Transitions , 2012, ICML.

[31]  Marc G. Bellemare,et al.  Investigating Contingency Awareness Using Atari 2600 Games , 2012, AAAI.

[32]  Tor Lattimore,et al.  PAC Bounds for Discounted MDPs , 2012, ALT.

[33]  Sébastien Bubeck,et al.  Regret Analysis of Stochastic and Nonstochastic Multi-armed Bandit Problems , 2012, Found. Trends Mach. Learn..

[34]  Odalric-Ambrym Maillard Hierarchical Optimistic Region Selection driven by Curiosity , 2012, NIPS.

[35]  Pierre-Yves Oudeyer,et al.  Exploration in Model-based Reinforcement Learning by Empirically Estimating Learning Progress , 2012, NIPS.

[36]  Marcus Hutter,et al.  Sparse Adaptive Dirichlet-Multinomial-like Processes , 2013, COLT.

[37]  Laurent Orseau,et al.  Universal Knowledge-Seeking Agents for Stochastic Environments , 2013, ALT.

[38]  Andrew G. Barto,et al.  Intrinsic Motivation and Reinforcement Learning , 2013, Intrinsically Motivated Learning in Natural and Artificial Systems.

[39]  Marc G. Bellemare,et al.  Skip Context Tree Switching , 2014, ICML.

[40]  Sergey Levine,et al.  Incentivizing Exploration In Reinforcement Learning With Deep Predictive Models , 2015, ArXiv.

[41]  Marc G. Bellemare,et al.  Compress and Control , 2015, AAAI.

[42]  Marc G. Bellemare Count-Based Frequency Estimation with Bounded Memory , 2015, IJCAI.

[43]  Marlos C. Machado,et al.  Domain-Independent Optimistic Initialization for Reinforcement Learning , 2014, AAAI Workshop: Learning for General Competency in Video Games.

[44]  Shakir Mohamed,et al.  Variational Information Maximisation for Intrinsically Motivated Reinforcement Learning , 2015, NIPS.

[45]  Shane Legg,et al.  Human-level control through deep reinforcement learning , 2015, Nature.

[46]  Yann Ollivier,et al.  Laplace's Rule of Succession in Information Geometry , 2015, GSI.

[47]  Honglak Lee,et al.  Action-Conditional Video Prediction using Deep Networks in Atari Games , 2015, NIPS.

[48]  Marc G. Bellemare,et al.  The Arcade Learning Environment: An Evaluation Platform for General Agents (Extended Abstract) , 2012, IJCAI.

[49]  Laurent Orseau,et al.  Thompson Sampling is Asymptotically Optimal in General Environments , 2016, UAI.

[50]  Filip De Turck,et al.  VIME: Variational Information Maximizing Exploration , 2016, NIPS.

[51]  David Silver,et al.  Deep Reinforcement Learning with Double Q-Learning , 2015, AAAI.

[52]  Koray Kavukcuoglu,et al.  Pixel Recurrent Neural Networks , 2016, ICML.

[53]  Marlos C. Machado,et al.  State of the Art Control of Atari Games Using Shallow Reinforcement Learning , 2015, AAMAS.

[54]  Alex Graves,et al.  Asynchronous Methods for Deep Reinforcement Learning , 2016, ICML.

[55]  Demis Hassabis,et al.  Mastering the game of Go with deep neural networks and tree search , 2016, Nature.

[56]  Jason Pazis,et al.  Efficient PAC-Optimal Exploration in Concurrent, Continuous State MDPs with Delayed Updates , 2016, AAAI.

[57]  Tom Schaul,et al.  Prioritized Experience Replay , 2015, ICLR.

[58]  J. Schulman,et al.  Variational Information Maximizing Exploration , 2016 .

[59]  Marc G. Bellemare,et al.  Increasing the Action Gap: New Operators for Reinforcement Learning , 2015, AAAI.