Self-Paced Contextual Reinforcement Learning

Generalization and adaptation of learned skills to novel situations is a core requirement for intelligent autonomous robots. Although contextual reinforcement learning provides a principled framework for learning and generalization of behaviors across related tasks, it generally relies on uninformed sampling of environments from an unknown, uncontrolled context distribution, thus missing the benefits of structured, sequential learning. We introduce a novel relative entropy reinforcement learning algorithm that gives the agent the freedom to control the intermediate task distribution, allowing for its gradual progression towards the target context distribution. Empirical evaluation shows that the proposed curriculum learning scheme drastically improves sample efficiency and enables learning in scenarios with both broad and sharp target context distributions in which classical approaches perform sub-optimally.

[1]  Petros Koumoutsakos,et al.  Reducing the Time Complexity of the Derandomized Evolution Strategy with Covariance Matrix Adaptation (CMA-ES) , 2003, Evolutionary Computation.

[2]  Eric Eaton,et al.  Online Multi-Task Learning for Policy Gradient Methods , 2014, ICML.

[3]  Pierre-Yves Oudeyer,et al.  Intrinsically motivated goal exploration for active motor learning in robots: A case study , 2010, 2010 IEEE/RSJ International Conference on Intelligent Robots and Systems.

[4]  Sergey Levine,et al.  End-to-End Training of Deep Visuomotor Policies , 2015, J. Mach. Learn. Res..

[5]  Yuval Tassa,et al.  MuJoCo: A physics engine for model-based control , 2012, 2012 IEEE/RSJ International Conference on Intelligent Robots and Systems.

[6]  Sergey Levine,et al.  Temporal Difference Models: Model-Free Deep RL for Model-Based Control , 2018, ICLR.

[7]  Peter Stone,et al.  Learning Curriculum Policies for Reinforcement Learning , 2018, AAMAS.

[8]  Yevgen Chebotar,et al.  Closing the Sim-to-Real Loop: Adapting Simulation Randomization with Real World Experience , 2018, 2019 International Conference on Robotics and Automation (ICRA).

[9]  Demis Hassabis,et al.  Mastering the game of Go without human knowledge , 2017, Nature.

[10]  Qiang Yang,et al.  A Survey on Transfer Learning , 2010, IEEE Transactions on Knowledge and Data Engineering.

[11]  Pieter Abbeel,et al.  Automatic Goal Generation for Reinforcement Learning Agents , 2017, ICML.

[12]  Jan Peters,et al.  Probabilistic Movement Primitives , 2013, NIPS.

[13]  Jason Weston,et al.  Curriculum learning , 2009, ICML '09.

[14]  Phil Husbands,et al.  Once More Unto the Breach: Co-evolving a robot and its simulator , 2004 .

[15]  Yasemin Altun,et al.  Relative Entropy Policy Search , 2010 .

[16]  Pierre-Yves Oudeyer,et al.  Intelligent Adaptive Curiosity: a source of Self-Development , 2004 .

[17]  Jan Peters,et al.  A Survey on Policy Search for Robotics , 2013, Found. Trends Robotics.

[18]  Pierre-Yves Oudeyer,et al.  Accuracy-based Curriculum Learning in Deep Reinforcement Learning , 2018, ArXiv.

[19]  Eugene L. Allgower,et al.  Numerical continuation methods - an introduction , 1990, Springer series in computational mathematics.

[20]  Daphne Koller,et al.  Self-Paced Learning for Latent Variable Models , 2010, NIPS.

[21]  David Cohn,et al.  Active Learning , 2010, Encyclopedia of Machine Learning.

[22]  B. Skinner,et al.  The Behavior of Organisms: An Experimental Analysis , 2016 .

[23]  Tom Schaul,et al.  Universal Value Function Approximators , 2015, ICML.

[24]  Jan Peters,et al.  Noname manuscript No. (will be inserted by the editor) Policy Search for Motor Primitives in Robotics , 2022 .

[25]  Marcin Andrychowicz,et al.  Hindsight Experience Replay , 2017, NIPS.

[26]  Nan Jiang,et al.  Markov Decision Processes with Continuous Side Information , 2017, ALT.

[27]  Pierre-Yves Oudeyer,et al.  Computational Theories of Curiosity-Driven Learning , 2018, ArXiv.

[28]  Alexander Fabisch,et al.  Active contextual policy search , 2014, J. Mach. Learn. Res..

[29]  Sergey Levine,et al.  Trust Region Policy Optimization , 2015, ICML.

[30]  W.D. Smart,et al.  What does shaping mean for computational reinforcement learning? , 2008, 2008 7th IEEE International Conference on Development and Learning.

[31]  Pieter Abbeel,et al.  Reverse Curriculum Generation for Reinforcement Learning , 2017, CoRL.

[32]  Peter Stone,et al.  Transfer Learning for Reinforcement Learning Domains: A Survey , 2009, J. Mach. Learn. Res..

[33]  Shane Legg,et al.  Human-level control through deep reinforcement learning , 2015, Nature.

[34]  P. Dayan,et al.  Flexible shaping: How learning in small steps helps , 2009, Cognition.

[35]  Alessandro Lazaric,et al.  Transfer in Reinforcement Learning: A Framework and a Survey , 2012, Reinforcement Learning.

[36]  Vladlen Koltun,et al.  Learning to Act by Predicting the Future , 2016, ICLR.

[37]  Jan Peters,et al.  Reinforcement learning vs human programming in tetherball robot games , 2015, 2015 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS).

[38]  Jan Peters,et al.  Data-Efficient Generalization of Robot Skills with Contextual Policy Search , 2013, AAAI.

[39]  Shiguang Shan,et al.  Self-Paced Curriculum Learning , 2015, AAAI.

[40]  Alec Radford,et al.  Proximal Policy Optimization Algorithms , 2017, ArXiv.

[41]  Gerhard Neumann,et al.  Variational Inference for Policy Search in changing situations , 2011, ICML.

[42]  Nuttapong Chentanez,et al.  Intrinsically Motivated Learning of Hierarchical Collections of Skills , 2004 .