Trust Region Policy Optimization

In this article, we describe a method for optimizing control policies, with guaranteed monotonic improvement. By making several approximations to the theoretically-justified scheme, we develop a practical algorithm, called Trust Region Policy Optimization (TRPO). This algorithm is effective for optimizing large nonlinear policies such as neural networks. Our experiments demonstrate its robust performance on a wide variety of tasks: learning simulated robotic swimming, hopping, and walking gaits; and playing Atari games using images of the screen as input. Despite its approximations that deviate from the theory, TRPO tends to give monotonic improvement, with little tuning of hyperparameters.

[1]  Richard S. Sutton,et al.  Neuronlike adaptive elements that can solve difficult learning control problems , 1983, IEEE Transactions on Systems, Man, and Cybernetics.

[2]  Dimitri P. Bertsekas,et al.  Dynamic Programming and Optimal Control, Two Volume Set , 1995 .

[3]  Nikolaus Hansen,et al.  Adapting arbitrary normal mutation distributions in evolution strategies: the covariance matrix adaptation , 1996, Proceedings of IEEE International Conference on Evolutionary Computation.

[4]  David K. Smith,et al.  Dynamic Programming and Optimal Control. Volume 1 , 1996 .

[5]  Stephen J. Wright,et al.  Numerical Optimization (Springer Series in Operations Research and Financial Engineering) , 2000 .

[6]  Michael I. Jordan,et al.  PEGASUS: A policy search method for large MDPs and POMDPs , 2000, UAI.

[7]  Peter L. Bartlett,et al.  Infinite-Horizon Policy-Gradient Estimation , 2001, J. Artif. Intell. Res..

[8]  Sham M. Kakade,et al.  A Natural Policy Gradient , 2001, NIPS.

[9]  John Langford,et al.  Approximately Optimal Approximate Reinforcement Learning , 2002, ICML.

[10]  Jeff G. Schneider,et al.  Covariant Policy Search , 2003, IJCAI.

[11]  Michail G. Lagoudakis,et al.  Reinforcement Learning as Classification: Leveraging Modern Classifiers , 2003, ICML.

[12]  H. Sebastian Seung,et al.  Stochastic policy gradient reinforcement learning on a simple 3D biped , 2004, 2004 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS) (IEEE Cat. No.04CH37566).

[13]  D. Hunter,et al.  A Tutorial on MM Algorithms , 2004 .

[14]  Fred W. Glover,et al.  Simulation optimization: a review, new developments, and applications , 2005, Proceedings of the Winter Simulation Conference, 2005..

[15]  Florentin Wörgötter,et al.  Fast biped walking with a reflexive controller and real-time policy searching , 2005, NIPS.

[16]  Stefan Schaal,et al.  Natural Actor-Critic , 2003, Neurocomputing.

[17]  András Lörincz,et al.  Learning Tetris Using the Noisy Cross-Entropy Method , 2006, Neural Computation.

[18]  Stefan Schaal,et al.  Reinforcement learning by reward-weighted regression for operational space control , 2007, ICML '07.

[19]  Arkadi Nemirovski,et al.  EFFICIENT METHODS IN CONVEX PROGRAMMING , 2007 .

[20]  Stefan Schaal,et al.  2008 Special Issue: Reinforcement learning of motor skills with policy gradients , 2008 .

[21]  K. Wampler,et al.  Optimal gait and form for animal locomotion , 2009, SIGGRAPH 2009.

[22]  Yasemin Altun,et al.  Relative Entropy Policy Search , 2010 .

[23]  Yuval Tassa,et al.  MuJoCo: A physics engine for model-based control , 2012, 2012 IEEE/RSJ International Conference on Intelligent Robots and Systems.

[24]  Ilya Sutskever,et al.  Training Deep and Recurrent Networks with Hessian-Free Optimization , 2012, Neural Networks: Tricks of the Trade.

[25]  V. Climenhaga Markov chains and mixing times , 2013 .

[26]  Alex Graves,et al.  Playing Atari with Deep Reinforcement Learning , 2013, ArXiv.

[27]  Jan Peters,et al.  A Survey on Policy Search for Robotics , 2013, Found. Trends Robotics.

[28]  Bruno Scherrer,et al.  Approximate Dynamic Programming Finally Performs Well in the Game of Tetris , 2013, NIPS.

[29]  Daniele Calandriello,et al.  Safe Policy Iteration , 2013, ICML.

[30]  Honglak Lee,et al.  Deep Learning for Real-Time Atari Game Play Using Offline Monte-Carlo Tree Search Planning , 2014, NIPS.

[31]  Sergey Levine,et al.  Learning Neural Network Policies with Guided Policy Search under Unknown Dynamics , 2014, NIPS.

[32]  Razvan Pascanu,et al.  Revisiting Natural Gradient for Deep Networks , 2013, ICLR.

[33]  Marc G. Bellemare,et al.  The Arcade Learning Environment: An Evaluation Platform for General Agents (Extended Abstract) , 2012, IJCAI.