SkillS: Adaptive Skill Sequencing for Efficient Temporally-Extended Exploration

The ability to effectively reuse prior knowledge is a key requirement when building general and flexible Reinforcement Learning (RL) agents. Skill reuse is one of the most common approaches, but current methods have considerable limitations.For example, fine-tuning an existing policy frequently fails, as the policy can degrade rapidly early in training. In a similar vein, distillation of expert behavior can lead to poor results when given sub-optimal experts. We compare several common approaches for skill transfer on multiple domains including changes in task and system dynamics. We identify how existing methods can fail and introduce an alternative approach to mitigate these problems. Our approach learns to sequence existing temporally-extended skills for exploration but learns the final policy directly from the raw experience. This conceptual split enables rapid adaptation and thus efficient data collection but without constraining the final solution.It significantly outperforms many classical methods across a suite of evaluation tasks and we use a broad set of ablations to highlight the importance of differentc omponents of our method.

[1]  Michael Milford,et al.  Bayesian controller fusion: Leveraging control priors in deep reinforcement learning for robotics , 2021, Int. J. Robotics Res..

[2]  N. Heess,et al.  NeRF2Real: Sim2real Transfer of Vision-guided Bipedal Motion Skills using Neural Radiance Fields , 2022, 2023 IEEE International Conference on Robotics and Automation (ICRA).

[3]  Pierre-Luc Bacon,et al.  The Primacy Bias in Deep Reinforcement Learning , 2022, ICML.

[4]  Matthew W. Hoffman,et al.  Revisiting Gaussian mixture critics in off-policy reinforcement learning: a sample-based approach , 2022, ArXiv.

[5]  R. Hadsell,et al.  Imitate and Repurpose: Learning Reusable Robot Movement Skills From Human and Animal Behaviors , 2022, ArXiv.

[6]  R. Hadsell,et al.  Learning Transferable Motor Skills with Hierarchical Latent Mixture Policies , 2021, ICLR.

[7]  Yuval Tassa,et al.  Evaluating model-based planning and planner amortization for continuous control , 2021, ICLR.

[8]  Yuval Tassa,et al.  From Motor Control to Team Play in Simulated Humanoid Football , 2021, Sci. Robotics.

[9]  Martin A. Riedmiller,et al.  Collect & Infer - a fresh look at data-efficient Reinforcement Learning , 2021, CoRL.

[10]  Sandy H. Huang,et al.  On Multi-objective Policy Optimization as a Tool for Reinforcement Learning , 2021, ArXiv.

[11]  Jan Peters,et al.  SKID RAW: Skill Discovery From Raw Trajectories , 2021, IEEE Robotics and Automation Letters.

[12]  Charles Blundell,et al.  Beyond Fine-Tuning: Transferring Behavior in Reinforcement Learning , 2021, 2102.13515.

[13]  W. Xu,et al.  Hierarchical Reinforcement Learning By Discovering Intrinsic Options , 2021, ICLR.

[14]  Sergey Levine,et al.  Parrot: Data-Driven Behavioral Priors for Reinforcement Learning , 2020, ICLR.

[15]  S. Levine,et al.  OPAL: Offline Primitive Discovery for Accelerating Offline Reinforcement Learning , 2020, ICLR.

[16]  Martin A. Riedmiller,et al.  Data-efficient Hindsight Off-policy Option Learning , 2020, ICML.

[17]  Shimon Whiteson,et al.  Transient Non-stationarity and Generalisation in Deep Reinforcement Learning , 2020, ICLR.

[18]  Yee Whye Teh,et al.  Behavior Priors for Efficient Reinforcement Learning , 2020, J. Mach. Learn. Res..

[19]  Joseph J. Lim,et al.  Accelerating Reinforcement Learning with Learned Skill Priors , 2020, CoRL.

[20]  N. Heess,et al.  Learning Dexterous Manipulation from Suboptimal Experts , 2020, CoRL.

[21]  N. Heess,et al.  Importance Weighted Policy Learning and Adaptation , 2020, 2009.04875.

[22]  Martin A. Riedmiller,et al.  Towards General and Autonomous Learning of Core Skills: A Case Study in Locomotion , 2020, CoRL.

[23]  Nando de Freitas,et al.  Critic Regularized Regression , 2020, NeurIPS.

[24]  Abhinav Gupta,et al.  Discovering Motor Programs by Recomposing Demonstrations , 2020, ICLR.

[25]  Martin A. Riedmiller,et al.  Keep Doing What Worked: Behavioral Modelling Priors for Offline Reinforcement Learning , 2020, ICLR.

[26]  Ryan P. Adams,et al.  On Warm-Starting Neural Network Training , 2019, NeurIPS.

[27]  Martin A. Riedmiller,et al.  Compositional Transfer in Hierarchical Reinforcement Learning , 2019, Robotics: Science and Systems.

[28]  Pieter Abbeel,et al.  Sub-policy Adaptation for Hierarchical Reinforcement Learning , 2019, ICLR.

[29]  Martin A. Riedmiller,et al.  Continuous-Discrete Reinforcement Learning for Hybrid Control in Robotics , 2020, CoRL.

[30]  Sergey Levine,et al.  Advantage-Weighted Regression: Simple and Scalable Off-Policy Reinforcement Learning , 2019, ArXiv.

[31]  Yifan Wu,et al.  Behavior Regularized Offline Reinforcement Learning , 2019, ArXiv.

[32]  S. Levine,et al.  Why Does Hierarchy (Sometimes) Work So Well in Reinforcement Learning? , 2019, ArXiv.

[33]  Natasha Jaques,et al.  Way Off-Policy Batch Deep Reinforcement Learning of Implicit Human Preferences in Dialog , 2019, ArXiv.

[34]  Sergey Levine,et al.  MCP: Learning Composable Hierarchical Control with Multiplicative Compositional Policies , 2019, NeurIPS.

[35]  Yee Whye Teh,et al.  Information asymmetry in KL-regularized RL , 2019, ICLR.

[36]  Yee Whye Teh,et al.  Exploiting Hierarchy for Learning and Transfer in KL-regularized RL , 2019, ArXiv.

[37]  S. Levine,et al.  Learning Latent Plans from Play , 2019, CoRL.

[38]  Pushmeet Kohli,et al.  CompILE: Compositional Imitation Learning and Execution , 2018, ICML.

[39]  Yee Whye Teh,et al.  Neural probabilistic motor primitives for humanoid control , 2018, ICLR.

[40]  Sergey Levine,et al.  Latent Space Policies for Hierarchical Reinforcement Learning , 2018, ICML.

[41]  Jan Peters,et al.  Using probabilistic movement primitives in robotics , 2018, Auton. Robots.

[42]  Shimon Whiteson,et al.  TACO: Learning Task Decomposition via Temporal Alignment for Control , 2018, ICML.

[43]  Martin A. Riedmiller,et al.  Learning by Playing - Solving Sparse Reward Tasks from Scratch , 2018, ICML.

[44]  Yuval Tassa,et al.  Maximum a Posteriori Policy Optimisation , 2018, ICLR.

[45]  Karol Hausman,et al.  Learning an Embedding Space for Transferable Robot Skills , 2018, ICLR.

[46]  Jan Peters,et al.  Learning movement primitive libraries through probabilistic segmentation , 2017, Int. J. Robotics Res..

[47]  Pieter Abbeel,et al.  Stochastic Neural Networks for Hierarchical Reinforcement Learning , 2016, ICLR.

[48]  Doina Precup,et al.  The Option-Critic Architecture , 2016, AAAI.

[49]  Reinhard Klein,et al.  Efficient Unsupervised Temporal Segmentation of Motion Data , 2015, IEEE Transactions on Multimedia.

[50]  Yuval Tassa,et al.  Learning and Transfer of Modulated Locomotor Controllers , 2016, ArXiv.

[51]  Razvan Pascanu,et al.  Policy Distillation , 2015, ICLR.

[52]  Ruslan Salakhutdinov,et al.  Actor-Mimic: Deep Multitask and Transfer Reinforcement Learning , 2015, ICLR.

[53]  Pravesh Ranchod,et al.  Nonparametric Bayesian reward segmentation for skill discovery using inverse reinforcement learning , 2015, 2015 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS).

[54]  Jan Peters,et al.  Probabilistic segmentation applied to an assembly task , 2015, 2015 IEEE-RAS 15th International Conference on Humanoid Robots (Humanoids).

[55]  Jan Peters,et al.  Probabilistic Movement Primitives , 2013, NIPS.

[56]  Jun Nakanishi,et al.  Dynamical Movement Primitives: Learning Attractor Models for Motor Behaviors , 2013, Neural Computation.

[57]  Jan Peters,et al.  Hierarchical Relative Entropy Policy Search , 2014, AISTATS.

[58]  Yuval Tassa,et al.  MuJoCo: A physics engine for model-based control , 2012, 2012 IEEE/RSJ International Conference on Intelligent Robots and Systems.

[59]  Oliver Kroemer,et al.  Learning to select and generalize striking movements in robot table tennis , 2012, AAAI Fall Symposium: Robots Learning Interactively from Human Teachers.

[60]  Scott Niekum,et al.  Clustering via Dirichlet Process Mixture Models for Portable Skill Discovery , 2011, Lifelong Learning.

[61]  Geoffrey J. Gordon,et al.  A Reduction of Imitation Learning and Structured Prediction to No-Regret Online Learning , 2010, AISTATS.

[62]  Markus Wulfmeier,et al.  Strength Through Diversity: Robust Behavior Learning via Mixture Policies , 2010 .

[63]  Sebastian O. H. Madgwick,et al.  An efficient orientation filter for inertial and inertial / magnetic sensor arrays , 2010 .

[64]  Jude W. Shavlik,et al.  Relational Macros for Transfer in Reinforcement Learning , 2007, ILP.

[65]  Jun Nakanishi,et al.  Learning Movement Primitives , 2005, ISRR.

[66]  Andrew G. Barto,et al.  PolicyBlocks: An Algorithm for Creating Useful Macro-Actions in Reinforcement Learning , 2002, ICML.

[67]  Doina Precup,et al.  Between MDPs and Semi-MDPs: A Framework for Temporal Abstraction in Reinforcement Learning , 1999, Artif. Intell..

[68]  Daniel S. Bernstein,et al.  Reusing Old Policies to Accelerate Learning on New MDPs , 1999 .

[69]  Michael H. Bowling,et al.  Reusing Learned Policies Between Similar Problems , 1998 .

[70]  Sebastian Thrun,et al.  Finding Structure in Reinforcement Learning , 1994, NIPS.