Compositional Transfer in Hierarchical Reinforcement Learning

The successful application of general reinforcement learning algorithms to real-world robotics applications is often limited by their high data requirements. We introduce Regularized Hierarchical Policy Optimization (RHPO) to improve data-efficiency for domains with multiple dominant tasks and ultimately reduce required platform time. To this end, we employ compositional inductive biases on multiple levels and corresponding mechanisms for sharing off-policy transition data across low-level controllers and tasks as well as scheduling of tasks. The presented algorithm enables stable and fast learning for complex, real-world domains in the parallel multitask and sequential transfer case. We show that the investigated types of hierarchy enable positive transfer while partially mitigating negative interference and evaluate the benefits of additional incentives for efficient, compositional task solutions in single task domains. Finally, we demonstrate substantial data-efficiency and final performance gains over competitive baselines in a week-long, physical robot stacking experiment.

[1]  Yee Whye Teh,et al.  Exploiting Hierarchy for Learning and Transfer in KL-regularized RL , 2019, ArXiv.

[2]  C. Bishop Mixture density networks , 1994 .

[3]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[4]  Marcin Andrychowicz,et al.  Hindsight Experience Replay , 2017, NIPS.

[5]  Ben Poole,et al.  Categorical Reparameterization with Gumbel-Softmax , 2016, ICLR.

[6]  Demis Hassabis,et al.  Mastering the game of Go without human knowledge , 2017, Nature.

[7]  Qiang Yang,et al.  A Survey on Transfer Learning , 2010, IEEE Transactions on Knowledge and Data Engineering.

[8]  Jan Peters,et al.  Hierarchical Relative Entropy Policy Search , 2014, AISTATS.

[9]  Doina Precup,et al.  The Option-Critic Architecture , 2016, AAAI.

[10]  Jakub W. Pachocki,et al.  Learning dexterous in-hand manipulation , 2018, Int. J. Robotics Res..

[11]  Sergey Levine,et al.  Near-Optimal Representation Learning for Hierarchical Reinforcement Learning , 2018, ICLR.

[12]  Yuval Tassa,et al.  Maximum a Posteriori Policy Optimisation , 2018, ICLR.

[13]  Joelle Pineau,et al.  An Inference-Based Policy Gradient Method for Learning Options , 2018, ICML.

[14]  Max Welling,et al.  Auto-Encoding Variational Bayes , 2013, ICLR.

[15]  Shimon Whiteson,et al.  Multitask Soft Option Learning , 2019, UAI.

[16]  Pieter Abbeel,et al.  Mutual Alignment Transfer Learning , 2017, CoRL.

[17]  Richard S. Sutton,et al.  Learning to predict by the methods of temporal differences , 1988, Machine Learning.

[18]  Peter Stone,et al.  Transfer Learning for Reinforcement Learning Domains: A Survey , 2009, J. Mach. Learn. Res..

[19]  Sergey Levine,et al.  Trust Region Policy Optimization , 2015, ICML.

[20]  Sergio Gomez Colmenarejo,et al.  TF-Replicator: Distributed Machine Learning for Researchers , 2019, ArXiv.

[21]  Thomas G. Dietterich,et al.  To transfer or not to transfer , 2005, NIPS 2005.

[22]  Emilio Soria Olivas,et al.  Handbook of Research on Machine Learning Applications and Trends : Algorithms , Methods , and Techniques , 2009 .

[23]  Yuval Tassa,et al.  Learning and Transfer of Modulated Locomotor Controllers , 2016, ArXiv.

[24]  Karol Hausman,et al.  Learning an Embedding Space for Transferable Robot Skills , 2018, ICLR.

[25]  Geoffrey E. Hinton,et al.  Feudal Reinforcement Learning , 1992, NIPS.

[26]  Shane Legg,et al.  Human-level control through deep reinforcement learning , 2015, Nature.

[27]  Marc G. Bellemare,et al.  Safe and Efficient Off-Policy Reinforcement Learning , 2016, NIPS.

[28]  Shane Legg,et al.  IMPALA: Scalable Distributed Deep-RL with Importance Weighted Actor-Learner Architectures , 2018, ICML.

[29]  Yuval Tassa,et al.  Relative Entropy Regularized Policy Iteration , 2018, ArXiv.

[30]  Rich Caruana,et al.  Multitask Learning , 1997, Machine Learning.

[31]  Gerald Tesauro,et al.  Learning Abstract Options , 2018, NeurIPS.

[32]  Shimon Whiteson,et al.  TACO: Learning Task Decomposition via Temporal Alignment for Control , 2018, ICML.

[33]  Doina Precup,et al.  Temporal abstraction in reinforcement learning , 2000, ICML 2000.

[34]  Jaime G. Carbonell,et al.  Characterizing and Avoiding Negative Transfer , 2018, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[35]  Yee Whye Teh,et al.  Distral: Robust multitask reinforcement learning , 2017, NIPS.

[36]  Tom Schaul,et al.  Reinforcement Learning with Unsupervised Auxiliary Tasks , 2016, ICLR.

[37]  Michael Figurnov,et al.  Monte Carlo Gradient Estimation in Machine Learning , 2019, J. Mach. Learn. Res..

[38]  Yee Whye Teh,et al.  Information asymmetry in KL-regularized RL , 2019, ICLR.

[39]  Yee Whye Teh,et al.  The Concrete Distribution: A Continuous Relaxation of Discrete Random Variables , 2016, ICLR.

[40]  Alex Graves,et al.  Playing Atari with Deep Reinforcement Learning , 2013, ArXiv.

[41]  Dan Klein,et al.  Modular Multitask Reinforcement Learning with Policy Sketches , 2016, ICML.

[42]  Tom Schaul,et al.  FeUdal Networks for Hierarchical Reinforcement Learning , 2017, ICML.

[43]  Yuval Tassa,et al.  Learning Continuous Control Policies by Stochastic Value Gradients , 2015, NIPS.

[44]  Doina Precup,et al.  The Termination Critic , 2019, AISTATS.

[45]  Martin A. Riedmiller,et al.  Learning by Playing - Solving Sparse Reward Tasks from Scratch , 2018, ICML.

[46]  R. French Catastrophic forgetting in connectionist networks , 1999, Trends in Cognitive Sciences.

[47]  Doina Precup,et al.  Between MDPs and Semi-MDPs: A Framework for Temporal Abstraction in Reinforcement Learning , 1999, Artif. Intell..

[48]  Doina Precup,et al.  When Waiting is not an Option : Learning Options with a Deliberation Cost , 2017, AAAI.

[49]  Sergey Levine,et al.  Latent Space Policies for Hierarchical Reinforcement Learning , 2018, ICML.

[50]  John R. Anderson,et al.  The Transfer of Cognitive Skill , 1989 .