论文信息 - An information-theoretic approach to curiosity-driven reinforcement learning

An information-theoretic approach to curiosity-driven reinforcement learning

We provide a fresh look at the problem of exploration in reinforcement learning, drawing on ideas from information theory. First, we show that Boltzmann-style exploration, one of the main exploration methods used in reinforcement learning, is optimal from an information-theoretic point of view, in that it optimally trades expected return for the coding cost of the policy. Second, we address the problem of curiosity-driven learning. We propose that, in addition to maximizing the expected return, a learner should choose a policy that also maximizes the learner’s predictive power. This makes the world both interesting and exploitable. Optimal policies then have the form of Boltzmann-style exploration with a bonus, containing a novel exploration–exploitation trade-off which emerges naturally from the proposed optimization principle. Importantly, this exploration–exploitation trade-off persists in the optimal deterministic policy, i.e., when there is no exploration due to randomness. As a result, exploration is understood as an emerging behavior that optimizes information gain, rather than being modeled as pure randomization of action choices.

Doina Precup | Susanne Still | Doina Precup | Susanne Still | S. Still

[1] E. Jaynes. Information Theory and Statistical Mechanics , 1957 .

[2] Robert Shaw,et al. The Dripping Faucet As A Model Chaotic System , 1984 .

[3] C. Watkins. Learning from delayed rewards , 1989 .

[4] Rose,et al. Statistical mechanics and phase transitions in clustering. , 1990, Physical review letters.

[5] Sebastian Thrun,et al. Active Exploration in Dynamic Environments , 1991, NIPS.

[6] Jürgen Schmidhuber,et al. Curious model-building control systems , 1991, [Proceedings] 1991 IEEE International Joint Conference on Neural Networks.

[7] Naftali Tishby,et al. Distributional Clustering of English Words , 1993, ACL.

[8] Ben J. A. Kröse,et al. Learning from delayed rewards , 1995, Robotics Auton. Syst..

[9] K. Rose. Deterministic annealing for clustering, compression, classification, regression, and related optimization problems , 1998, Proc. IEEE.

[10] Naftali Tishby,et al. The information bottleneck method , 2000, ArXiv.

[11] James P. Crutchfield,et al. Synchronizing to the Environment: Information-Theoretic Constraints on Agent Learning , 2001, Adv. Complex Syst..

[12] A. U.S.,et al. Predictability , Complexity , and Learning , 2002 .

[13] Ronen I. Brafman,et al. R-MAX - A General Polynomial Time Algorithm for Near-Optimal Reinforcement Learning , 2001, J. Mach. Learn. Res..

[14] Doina Precup,et al. Using MDP Characteristics to Guide Exploration in Reinforcement Learning , 2003, ECML.

[15] Jeff G. Schneider,et al. Covariant Policy Search , 2003, IJCAI.

[16] Gal Chechik,et al. Information Bottleneck for Gaussian Variables , 2003, J. Mach. Learn. Res..

[17] William Bialek,et al. Optimal Manifold Representation of Data: An Information Theoretic Approach , 2003, NIPS.

[18] William Bialek,et al. Geometric Clustering Using the Information Bottleneck Method , 2003, NIPS.

[19] J. Crutchfield,et al. Regularities unseen, randomness observed: levels of entropy convergence. , 2001, Chaos.

[20] Nuttapong Chentanez,et al. Intrinsically Motivated Reinforcement Learning , 2004, NIPS.

[21] Michael Kearns,et al. Near-Optimal Reinforcement Learning in Polynomial Time , 2002, Machine Learning.

[22] William Bialek,et al. How Many Clusters? An Information-Theoretic Perspective , 2003, Neural Computation.

[23] Richard S. Sutton,et al. Reinforcement Learning: An Introduction , 1998, IEEE Trans. Neural Networks.

[24] Sang Joon Kim,et al. A Mathematical Theory of Communication , 2006 .

[25] Lihong Li,et al. Incremental Model-based Learners With Formal Learning-Time Guarantees , 2006, UAI.

[26] Satinder P. Singh,et al. On discovery and learning of models with predictive representations of state for agents with continuous actions and observations , 2007, AAMAS '07.

[27] Pierre-Yves Oudeyer,et al. Intrinsic Motivation Systems for Autonomous Mental Development , 2007, IEEE Transactions on Evolutionary Computation.

[28] Ralf Der,et al. Predictive information and explorative behavior of autonomous robots , 2008 .

[29] KasabovNikola,et al. 2008 Special issue , 2008 .

[30] Stefan Schaal,et al. 2008 Special Issue: Reinforcement learning of motor skills with policy gradients , 2008 .

[31] J. Schmidhuber. Science as By-Products of Search for Novel Patterns , or Data Compressible in Unknown Yet Learnable Ways , 2009 .

[32] Emanuel Todorov,et al. Efficient computation of optimal actions , 2009, Proceedings of the National Academy of Sciences.

[33] Peter Stone,et al. Transfer Learning for Reinforcement Learning Domains: A Survey , 2009, J. Mach. Learn. Res..

[34] Susanne Still,et al. Information-theoretic approach to interactive learning , 2007, 0709.1948.

[35] Yasemin Altun,et al. Relative Entropy Policy Search , 2010 .

[36] Friedrich T. Sommer,et al. Learning in embodied action-perception loops through exploration , 2011, ArXiv.

[37] Daniel Polani,et al. Information Theory of Decisions and Actions , 2011 .

[38] Hilbert J. Kappen,et al. Dynamic policy programming , 2010, J. Mach. Learn. Res..