An information-theoretic approach to curiosity-driven reinforcement learning

We provide a fresh look at the problem of exploration in reinforcement learning, drawing on ideas from information theory. First, we show that Boltzmann-style exploration, one of the main exploration methods used in reinforcement learning, is optimal from an information-theoretic point of view, in that it optimally trades expected return for the coding cost of the policy. Second, we address the problem of curiosity-driven learning. We propose that, in addition to maximizing the expected return, a learner should choose a policy that also maximizes the learner’s predictive power. This makes the world both interesting and exploitable. Optimal policies then have the form of Boltzmann-style exploration with a bonus, containing a novel exploration–exploitation trade-off which emerges naturally from the proposed optimization principle. Importantly, this exploration–exploitation trade-off persists in the optimal deterministic policy, i.e., when there is no exploration due to randomness. As a result, exploration is understood as an emerging behavior that optimizes information gain, rather than being modeled as pure randomization of action choices.

[1]  E. Jaynes Information Theory and Statistical Mechanics , 1957 .

[2]  Robert Shaw,et al.  The Dripping Faucet As A Model Chaotic System , 1984 .

[3]  C. Watkins Learning from delayed rewards , 1989 .

[4]  Rose,et al.  Statistical mechanics and phase transitions in clustering. , 1990, Physical review letters.

[5]  Sebastian Thrun,et al.  Active Exploration in Dynamic Environments , 1991, NIPS.

[6]  Jürgen Schmidhuber,et al.  Curious model-building control systems , 1991, [Proceedings] 1991 IEEE International Joint Conference on Neural Networks.

[7]  Naftali Tishby,et al.  Distributional Clustering of English Words , 1993, ACL.

[8]  Ben J. A. Kröse,et al.  Learning from delayed rewards , 1995, Robotics Auton. Syst..

[9]  K. Rose Deterministic annealing for clustering, compression, classification, regression, and related optimization problems , 1998, Proc. IEEE.

[10]  Naftali Tishby,et al.  The information bottleneck method , 2000, ArXiv.

[11]  James P. Crutchfield,et al.  Synchronizing to the Environment: Information-Theoretic Constraints on Agent Learning , 2001, Adv. Complex Syst..

[12]  A. U.S.,et al.  Predictability , Complexity , and Learning , 2002 .

[13]  Ronen I. Brafman,et al.  R-MAX - A General Polynomial Time Algorithm for Near-Optimal Reinforcement Learning , 2001, J. Mach. Learn. Res..

[14]  Doina Precup,et al.  Using MDP Characteristics to Guide Exploration in Reinforcement Learning , 2003, ECML.

[15]  Jeff G. Schneider,et al.  Covariant Policy Search , 2003, IJCAI.

[16]  Gal Chechik,et al.  Information Bottleneck for Gaussian Variables , 2003, J. Mach. Learn. Res..

[17]  William Bialek,et al.  Optimal Manifold Representation of Data: An Information Theoretic Approach , 2003, NIPS.

[18]  William Bialek,et al.  Geometric Clustering Using the Information Bottleneck Method , 2003, NIPS.

[19]  J. Crutchfield,et al.  Regularities unseen, randomness observed: levels of entropy convergence. , 2001, Chaos.

[20]  Nuttapong Chentanez,et al.  Intrinsically Motivated Reinforcement Learning , 2004, NIPS.

[21]  Michael Kearns,et al.  Near-Optimal Reinforcement Learning in Polynomial Time , 2002, Machine Learning.

[22]  William Bialek,et al.  How Many Clusters? An Information-Theoretic Perspective , 2003, Neural Computation.

[23]  Richard S. Sutton,et al.  Reinforcement Learning: An Introduction , 1998, IEEE Trans. Neural Networks.

[24]  Sang Joon Kim,et al.  A Mathematical Theory of Communication , 2006 .

[25]  Lihong Li,et al.  Incremental Model-based Learners With Formal Learning-Time Guarantees , 2006, UAI.

[26]  Satinder P. Singh,et al.  On discovery and learning of models with predictive representations of state for agents with continuous actions and observations , 2007, AAMAS '07.

[27]  Pierre-Yves Oudeyer,et al.  Intrinsic Motivation Systems for Autonomous Mental Development , 2007, IEEE Transactions on Evolutionary Computation.

[28]  Ralf Der,et al.  Predictive information and explorative behavior of autonomous robots , 2008 .

[29]  KasabovNikola,et al.  2008 Special issue , 2008 .

[30]  Stefan Schaal,et al.  2008 Special Issue: Reinforcement learning of motor skills with policy gradients , 2008 .

[31]  J. Schmidhuber Science as By-Products of Search for Novel Patterns , or Data Compressible in Unknown Yet Learnable Ways , 2009 .

[32]  Emanuel Todorov,et al.  Efficient computation of optimal actions , 2009, Proceedings of the National Academy of Sciences.

[33]  Peter Stone,et al.  Transfer Learning for Reinforcement Learning Domains: A Survey , 2009, J. Mach. Learn. Res..

[34]  Susanne Still,et al.  Information-theoretic approach to interactive learning , 2007, 0709.1948.

[35]  Yasemin Altun,et al.  Relative Entropy Policy Search , 2010 .

[36]  Friedrich T. Sommer,et al.  Learning in embodied action-perception loops through exploration , 2011, ArXiv.

[37]  Daniel Polani,et al.  Information Theory of Decisions and Actions , 2011 .

[38]  Hilbert J. Kappen,et al.  Dynamic policy programming , 2010, J. Mach. Learn. Res..