Data-efficient Hindsight Off-policy Option Learning

Solutions to most complex tasks can be decomposed into simpler, intermediate skills, reusable across wider ranges of problems. We follow this concept and introduce Hindsight Off-policy Options (HO2), a new algorithm for efficient and robust option learning. The algorithm relies on critic-weighted maximum likelihood estimation and an efficient dynamic programming inference procedure over off-policy trajectories. We can backpropagate through the inference procedure through time and the policy components for every time-step, making it possible to train all component's parameters off-policy, independently of the data-generating behavior policy. Experimentally, we demonstrate that HO2 outperforms competitive baselines and solves demanding robot stacking and ball-in-cup tasks from raw pixel inputs in simulation. We further compare autoregressive option policies with simple mixture policies, providing insights into the relative impact of two types of abstractions common in the options framework: action abstraction and temporal abstraction. Finally, we illustrate challenges caused by stale data in off-policy options learning and provide effective solutions.

[1]  Tom Schaul,et al.  FeUdal Networks for Hierarchical Reinforcement Learning , 2017, ICML.

[2]  Doina Precup,et al.  Between MDPs and Semi-MDPs: A Framework for Temporal Abstraction in Reinforcement Learning , 1999, Artif. Intell..

[3]  Martin A. Riedmiller,et al.  Regularized Hierarchical Policies for Compositional Transfer in Robotics , 2019, ArXiv.

[4]  Ion Stoica,et al.  Multi-Level Discovery of Deep Options , 2017, ArXiv.

[5]  Doina Precup,et al.  The Option-Critic Architecture , 2016, AAAI.

[6]  Geoffrey E. Hinton,et al.  Feudal Reinforcement Learning , 1992, NIPS.

[7]  Kate Saenko,et al.  Learning Multi-Level Hierarchies with Hindsight , 2017, ICLR.

[8]  Pieter Abbeel,et al.  Sub-policy Adaptation for Hierarchical Reinforcement Learning , 2019, ICLR.

[9]  Doina Precup,et al.  The Termination Critic , 2019, AISTATS.

[10]  Wojciech M. Czarnecki,et al.  Grandmaster level in StarCraft II using multi-agent reinforcement learning , 2019, Nature.

[11]  Sergey Levine,et al.  Data-Efficient Hierarchical Reinforcement Learning , 2018, NeurIPS.

[12]  Doina Precup,et al.  Off-policy Learning with Options and Recognizers , 2005, NIPS.

[13]  Marcin Andrychowicz,et al.  Hindsight Experience Replay , 2017, NIPS.

[14]  Yuan Yu,et al.  TensorFlow: A system for large-scale machine learning , 2016, OSDI.

[15]  Martin A. Riedmiller,et al.  Compositional Transfer in Hierarchical Reinforcement Learning , 2019, Robotics: Science and Systems.

[16]  C. Bishop Mixture density networks , 1994 .

[17]  Pieter Abbeel,et al.  Meta Learning Shared Hierarchies , 2017, ICLR.

[18]  Gerald Tesauro,et al.  Learning Abstract Options , 2018, NeurIPS.

[19]  Yee Whye Teh,et al.  Information asymmetry in KL-regularized RL , 2019, ICLR.

[20]  Karol Hausman,et al.  Learning an Embedding Space for Transferable Robot Skills , 2018, ICLR.

[21]  Shimon Whiteson,et al.  DAC: The Double Actor-Critic Architecture for Learning Options , 2019, NeurIPS.

[22]  Yuval Tassa,et al.  Relative Entropy Regularized Policy Iteration , 2018, ArXiv.

[23]  Sergey Levine,et al.  Trust Region Policy Optimization , 2015, ICML.

[24]  Shimon Whiteson,et al.  TACO: Learning Task Decomposition via Temporal Alignment for Control , 2018, ICML.

[25]  Yee Whye Teh,et al.  Exploiting Hierarchy for Learning and Transfer in KL-regularized RL , 2019, ArXiv.

[26]  Alejandro Agostini,et al.  Reinforcement Learning with a Gaussian mixture model , 2010, The 2010 International Joint Conference on Neural Networks (IJCNN).

[27]  Joelle Pineau,et al.  An Inference-Based Policy Gradient Method for Learning Options , 2018, ICML.

[28]  Jan Peters,et al.  Hierarchical Relative Entropy Policy Search , 2014, AISTATS.

[29]  Lawrence R. Rabiner,et al.  A tutorial on hidden Markov models and selected applications in speech recognition , 1989, Proc. IEEE.

[30]  Doina Precup,et al.  When Waiting is not an Option : Learning Options with a Deliberation Cost , 2017, AAAI.

[31]  Doina Precup,et al.  Temporal abstraction in reinforcement learning , 2000, ICML 2000.

[32]  Yuval Tassa,et al.  Maximum a Posteriori Policy Optimisation , 2018, ICLR.

[33]  Demis Hassabis,et al.  Mastering the game of Go without human knowledge , 2017, Nature.

[34]  Martin A. Riedmiller,et al.  Learning by Playing - Solving Sparse Reward Tasks from Scratch , 2018, ICML.

[35]  Sergey Levine,et al.  Latent Space Policies for Hierarchical Reinforcement Learning , 2018, ICML.

[36]  Marcin Andrychowicz,et al.  Asymmetric Actor Critic for Image-Based Robot Learning , 2017, Robotics: Science and Systems.

[37]  Yee Whye Teh,et al.  Distral: Robust multitask reinforcement learning , 2017, NIPS.

[38]  Shimon Whiteson,et al.  Multitask Soft Option Learning , 2019, UAI.

[39]  Jakub W. Pachocki,et al.  Learning dexterous in-hand manipulation , 2018, Int. J. Robotics Res..

[40]  Thomas G. Dietterich,et al.  To transfer or not to transfer , 2005, NIPS 2005.

[41]  Alec Radford,et al.  Proximal Policy Optimization Algorithms , 2017, ArXiv.

[42]  Yuval Tassa,et al.  Learning and Transfer of Modulated Locomotor Controllers , 2016, ArXiv.

[43]  Sergey Levine,et al.  Near-Optimal Representation Learning for Hierarchical Reinforcement Learning , 2018, ICLR.

[44]  Ion Stoica,et al.  DDCO: Discovery of Deep Continuous Options for Robot Learning from Demonstrations , 2017, CoRL.

[45]  Misha Denil,et al.  The Intentional Unintentional Agent: Learning to Solve Many Continuous Control Tasks Simultaneously , 2017, CoRL.

[46]  Wojciech Zaremba,et al.  OpenAI Gym , 2016, ArXiv.