Hierarchical Relative Entropy Policy Search

Many reinforcement learning (RL) tasks, especially in robotics, consist of multiple sub-tasks that are strongly structured. Such task structures can be exploited by incorporating hierarchical policies that consist of gating networks and sub-policies. However, this concept has only been partially explored for real world settings and complete methods, derived from first principles, are needed. Real world settings are challenging due to large and continuous state-action spaces that are prohibitive for exhaustive sampling methods. We define the problem of learning sub-policies in continuous state action spaces as finding a hierarchical policy that is composed of a high-level gating policy to select the low-level sub-policies for execution by the agent. In order to efficiently share experience with all sub-policies, also called inter-policy learning, we treat these sub-policies as latent variables which allows for distribution of the update information between the sub-policies. We present three different variants of our algorithm, designed to be suitable for a wide variety of real world robot learning tasks and evaluate our algorithms in two real robot learning scenarios as well as several simulations and comparisons.

[1]  D. Rubin,et al.  Maximum likelihood from incomplete data via the EM - algorithm plus discussions on the paper , 1977 .

[2]  Richard S. Sutton,et al.  Generalization in Reinforcement Learning: Successful Examples Using Sparse Coarse Coding , 1995, NIPS.

[3]  Geoffrey E. Hinton,et al.  Using Expectation-Maximization for Reinforcement Learning , 1997, Neural Computation.

[4]  Doina Precup,et al.  Intra-Option Learning about Temporally Abstract Actions , 1998, ICML.

[5]  Richard S. Sutton,et al.  Introduction to Reinforcement Learning , 1998 .

[6]  Doina Precup,et al.  Between MDPs and Semi-MDPs: A Framework for Temporal Abstraction in Reinforcement Learning , 1999, Artif. Intell..

[7]  Yishay Mansour,et al.  Policy Gradient Methods for Reinforcement Learning with Function Approximation , 1999, NIPS.

[8]  Thomas G. Dietterich Hierarchical Reinforcement Learning with the MAXQ Value Function Decomposition , 1999, J. Artif. Intell. Res..

[9]  Peter L. Bartlett,et al.  Infinite-Horizon Policy-Gradient Estimation , 2001, J. Artif. Intell. Res..

[10]  Andrew G. Barto,et al.  Robot Weightlifting By Direct Policy Search , 2001, IJCAI.

[11]  Jeff G. Schneider,et al.  Autonomous helicopter control using reinforcement learning policy search methods , 2001, Proceedings 2001 ICRA. IEEE International Conference on Robotics and Automation (Cat. No.01CH37164).

[12]  Jun Morimoto,et al.  Acquisition of stand-up behavior by a real robot using hierarchical reinforcement learning , 2000, Robotics Auton. Syst..

[13]  Peter Stone,et al.  Scaling Reinforcement Learning toward RoboCup Soccer , 2001, ICML.

[14]  Jun Nakanishi,et al.  Learning Attractor Landscapes for Learning Motor Primitives , 2002, NIPS.

[15]  Andrew W. Moore,et al.  Policy Search using Paired Comparisons , 2003, J. Mach. Learn. Res..

[16]  Sridhar Mahadevan,et al.  Recent Advances in Hierarchical Reinforcement Learning , 2003, Discret. Event Dyn. Syst..

[17]  Petros Koumoutsakos,et al.  Reducing the Time Complexity of the Derandomized Evolution Strategy with Covariance Matrix Adaptation (CMA-ES) , 2003, Evolutionary Computation.

[18]  Jeff G. Schneider,et al.  Covariant Policy Search , 2003, IJCAI.

[19]  Jun Nakanishi,et al.  Learning Movement Primitives , 2005, ISRR.

[20]  Sridhar Mahadevan,et al.  Hierarchical Policy Gradient Algorithms , 2003, ICML.

[21]  Peter Stone,et al.  Policy gradient reinforcement learning for fast quadrupedal locomotion , 2004, IEEE International Conference on Robotics and Automation, 2004. Proceedings. ICRA '04. 2004.

[22]  Nuttapong Chentanez,et al.  Intrinsically Motivated Learning of Hierarchical Collections of Skills , 2004 .

[23]  Richard S. Sutton,et al.  Reinforcement Learning: An Introduction , 1998, IEEE Trans. Neural Networks.

[24]  Stefan Schaal,et al.  Natural Actor-Critic , 2003, Neurocomputing.

[25]  Stefan Schaal,et al.  Policy Gradient Methods for Robotics , 2006, 2006 IEEE/RSJ International Conference on Intelligent Robots and Systems.

[26]  Marc Toussaint,et al.  Probabilistic inference for solving (PO) MDPs , 2006 .

[27]  Ben Taskar,et al.  Expectation Maximization and Posterior Constraints , 2007, NIPS.

[28]  Jan Peters,et al.  Fitted Q-iteration by Advantage Weighted Regression , 2008, NIPS.

[29]  Jun Morimoto,et al.  Learning CPG-based Biped Locomotion with a Policy Gradient Method: Application to a Humanoid Robot , 2005, 5th IEEE-RAS International Conference on Humanoid Robots, 2005..

[30]  Betty J. Mohler,et al.  Learning perceptual coupling for motor primitives , 2008, 2008 IEEE/RSJ International Conference on Intelligent Robots and Systems.

[31]  Tom Schaul,et al.  Natural Evolution Strategies , 2008, 2008 IEEE Congress on Evolutionary Computation (IEEE World Congress on Computational Intelligence).

[32]  Jan Peters,et al.  Learning complex motions by sequencing simpler motion templates , 2009, ICML '09.

[33]  Ben Taskar,et al.  Posterior vs Parameter Sparsity in Latent Variable Models , 2009, NIPS.

[34]  Andrew G. Barto,et al.  Skill Discovery in Continuous Reinforcement Learning Domains using Skill Chaining , 2009, NIPS.

[35]  Jan Peters,et al.  Policy Search for Motor Primitives , 2009, Künstliche Intell..

[36]  Christoph H. Lampert,et al.  Movement templates for learning of hitting and batting , 2010, 2010 IEEE International Conference on Robotics and Automation.

[37]  Darwin G. Caldwell,et al.  Robot motor skill coordination with EM-based Reinforcement Learning , 2010, 2010 IEEE/RSJ International Conference on Intelligent Robots and Systems.

[38]  Jan Peters,et al.  Noname manuscript No. (will be inserted by the editor) Policy Search for Motor Primitives in Robotics , 2022 .

[39]  Yasemin Altun,et al.  Relative Entropy Policy Search , 2010 .

[40]  Marc Peter Deisenroth,et al.  Efficient reinforcement learning using Gaussian processes , 2010 .

[41]  Marc Toussaint,et al.  Approximate Inference and Stochastic Optimal Control , 2010, ArXiv.

[42]  Jun Morimoto,et al.  Task-Specific Generalization of Discrete and Periodic Dynamic Movement Primitives , 2010, IEEE Transactions on Robotics.

[43]  Stefan Schaal,et al.  A Generalized Path Integral Control Approach to Reinforcement Learning , 2010, J. Mach. Learn. Res..

[44]  Nahum Shimkin,et al.  Unified Inter and Intra Options Learning Using Policy Gradient Methods , 2011, EWRL.

[45]  Carl E. Rasmussen,et al.  PILCO: A Model-Based and Data-Efficient Approach to Policy Search , 2011, ICML.

[46]  Gerhard Neumann,et al.  Variational Inference for Policy Search in changing situations , 2011, ICML.

[47]  Doina Precup,et al.  An information-theoretic approach to curiosity-driven reinforcement learning , 2012, Theory in Biosciences.

[48]  Stefan Schaal,et al.  Hierarchical reinforcement learning with movement primitives , 2011, 2011 11th IEEE-RAS International Conference on Humanoid Robots.

[49]  Jan Peters,et al.  Reinforcement Learning to Adjust Robot Movements to New Situations , 2010, IJCAI.

[50]  Leslie Pack Kaelbling,et al.  Bayesian Policy Search with Policy Priors , 2011, IJCAI.

[51]  George Konidaris,et al.  Value Function Approximation in Reinforcement Learning Using the Fourier Basis , 2011, AAAI.

[52]  Oliver Kroemer,et al.  Learning to select and generalize striking movements in robot table tennis , 2012, AAAI Fall Symposium: Robots Learning Interactively from Human Teachers.

[53]  Marc Toussaint,et al.  On Stochastic Optimal Control and Reinforcement Learning by Approximate Inference , 2012, Robotics: Science and Systems.

[54]  Jan Peters,et al.  Learning concurrent motor skills in versatile solution spaces , 2012, 2012 IEEE/RSJ International Conference on Intelligent Robots and Systems.

[55]  Hilbert J. Kappen,et al.  Dynamic policy programming , 2010, J. Mach. Learn. Res..

[56]  Olivier Sigaud,et al.  Policy Improvement Methods: Between Black-Box Optimization and Episodic Reinforcement Learning , 2012 .

[57]  Andrew G. Barto,et al.  Motor primitive discovery , 2012, 2012 IEEE International Conference on Development and Learning and Epigenetic Robotics (ICDL).

[58]  Bruno Castro da Silva,et al.  Learning Parameterized Skills , 2012, ICML.

[59]  Darwin G. Caldwell,et al.  Compliant skills acquisition and multi-optima policy search with EM-based reinforcement learning , 2013, Robotics Auton. Syst..

[60]  Jan Peters,et al.  Data-Efficient Generalization of Robot Skills with Contextual Policy Search , 2013, AAAI.

[61]  Oliver Kroemer,et al.  Learning sequential motor tasks , 2013, 2013 IEEE International Conference on Robotics and Automation.

[62]  Sergey Levine,et al.  Learning Neural Network Policies with Guided Policy Search under Unknown Dynamics , 2014, NIPS.