Guided Policy Search

Direct policy search can effectively scale to high-dimensional systems, but complex policies with hundreds of parameters often present a challenge for such methods, requiring numerous samples and often falling into poor local optima. We present a guided policy search algorithm that uses trajectory optimization to direct policy learning and avoid poor local optima. We show how differential dynamic programming can be used to generate suitable guiding samples, and describe a regularized importance sampled policy optimization that incorporates these samples into the policy search. We evaluate the method by learning neural network controllers for planar swimming, hopping, and walking, as well as simulated 3D humanoid running.

[1]  Yishay Mansour,et al.  Policy Gradient Methods for Reinforcement Learning with Function Approximation , 1999, NIPS.

[2]  Peter L. Bartlett,et al.  Experiments with Infinite-Horizon, Policy-Gradient Estimation , 2001, J. Artif. Intell. Res..

[3]  Jun Morimoto,et al.  Nonparametric Representation of Policies and Value Functions: A Trajectory-Based Approach , 2002, NIPS.

[4]  John Langford,et al.  Approximately Optimal Approximate Reinforcement Learning , 2002, ICML.

[5]  Jun Nakanishi,et al.  Movement imitation with nonlinear dynamical systems in humanoid robots , 2002, Proceedings 2002 IEEE International Conference on Robotics and Automation (Cat. No.02CH37292).

[6]  Leonid Peshkin,et al.  Learning from Scarce Experience , 2002, ICML.

[7]  Jeff G. Schneider,et al.  Policy Search by Dynamic Programming , 2003, NIPS.

[8]  H. Sebastian Seung,et al.  Stochastic policy gradient reinforcement learning on a simple 3D biped , 2004, 2004 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS) (IEEE Cat. No.04CH37566).

[9]  Pieter Abbeel,et al.  An Application of Reinforcement Learning to Aerobatic Helicopter Flight , 2006, NIPS.

[10]  M. V. D. Panne,et al.  SIMBICON: simple biped locomotion control , 2007, SIGGRAPH 2007.

[11]  Christopher G. Atkeson,et al.  Random Sampling of States in Dynamic Programming , 2007, IEEE Transactions on Systems, Man, and Cybernetics, Part B (Cybernetics).

[12]  Stefan Schaal,et al.  2008 Special Issue: Reinforcement learning of motor skills with policy gradients , 2008 .

[13]  Jan Peters,et al.  Learning motor primitives for robotics , 2009, 2009 IEEE International Conference on Robotics and Automation.

[14]  Christopher G. Atkeson,et al.  Control of a walking biped using a combination of simple policies , 2009, 2009 9th IEEE-RAS International Conference on Humanoid Robots.

[15]  J. Andrew Bagnell,et al.  Modeling Purposeful Adaptive Behavior with the Principle of Maximum Causal Entropy , 2010 .

[16]  Pieter Abbeel,et al.  On a Connection between Importance Sampling and the Likelihood Ratio Policy Gradient , 2010, NIPS.

[17]  Emanuel Todorov,et al.  Inverse Optimal Control with Linearly-Solvable MDPs , 2010, ICML.

[18]  Stefan Schaal,et al.  STOMP: Stochastic trajectory optimization for motion planning , 2011, 2011 IEEE International Conference on Robotics and Automation.

[19]  Carl E. Rasmussen,et al.  PILCO: A Model-Based and Data-Efficient Approach to Policy Search , 2011, ICML.

[20]  Geoffrey J. Gordon,et al.  A Reduction of Imitation Learning and Structured Prediction to No-Regret Online Learning , 2010, AISTATS.

[21]  J. Andrew Bagnell,et al.  Agnostic System Identification for Model-Based Reinforcement Learning , 2012, ICML.

[22]  Yuval Tassa,et al.  Synthesis and stabilization of complex behaviors through online trajectory optimization , 2012, 2012 IEEE/RSJ International Conference on Intelligent Robots and Systems.

[23]  Yuval Tassa,et al.  MuJoCo: A physics engine for model-based control , 2012, 2012 IEEE/RSJ International Conference on Intelligent Robots and Systems.