论文信息 - Trust Region Policy Optimization

Trust Region Policy Optimization

In this article, we describe a method for optimizing control policies, with guaranteed monotonic improvement. By making several approximations to the theoretically-justified scheme, we develop a practical algorithm, called Trust Region Policy Optimization (TRPO). This algorithm is effective for optimizing large nonlinear policies such as neural networks. Our experiments demonstrate its robust performance on a wide variety of tasks: learning simulated robotic swimming, hopping, and walking gaits; and playing Atari games using images of the screen as input. Despite its approximations that deviate from the theory, TRPO tends to give monotonic improvement, with little tuning of hyperparameters.

[1] Richard S. Sutton,et al. Neuronlike adaptive elements that can solve difficult learning control problems , 1983, IEEE Transactions on Systems, Man, and Cybernetics.

[2] Dimitri P. Bertsekas,et al. Dynamic Programming and Optimal Control, Two Volume Set , 1995 .

[3] Nikolaus Hansen,et al. Adapting arbitrary normal mutation distributions in evolution strategies: the covariance matrix adaptation , 1996, Proceedings of IEEE International Conference on Evolutionary Computation.

[4] David K. Smith,et al. Dynamic Programming and Optimal Control. Volume 1 , 1996 .

[5] Stephen J. Wright,et al. Numerical Optimization (Springer Series in Operations Research and Financial Engineering) , 2000 .

[6] Michael I. Jordan,et al. PEGASUS: A policy search method for large MDPs and POMDPs , 2000, UAI.

[7] Peter L. Bartlett,et al. Infinite-Horizon Policy-Gradient Estimation , 2001, J. Artif. Intell. Res..

[8] Sham M. Kakade,et al. A Natural Policy Gradient , 2001, NIPS.

[9] John Langford,et al. Approximately Optimal Approximate Reinforcement Learning , 2002, ICML.

[10] Jeff G. Schneider,et al. Covariant Policy Search , 2003, IJCAI.

[11] Michail G. Lagoudakis,et al. Reinforcement Learning as Classification: Leveraging Modern Classifiers , 2003, ICML.

[12] H. Sebastian Seung,et al. Stochastic policy gradient reinforcement learning on a simple 3D biped , 2004, 2004 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS) (IEEE Cat. No.04CH37566).

[13] D. Hunter,et al. A Tutorial on MM Algorithms , 2004 .

[14] Fred W. Glover,et al. Simulation optimization: a review, new developments, and applications , 2005, Proceedings of the Winter Simulation Conference, 2005..

[15] Florentin Wörgötter,et al. Fast biped walking with a reflexive controller and real-time policy searching , 2005, NIPS.

[16] Stefan Schaal,et al. Natural Actor-Critic , 2003, Neurocomputing.

[17] András Lörincz,et al. Learning Tetris Using the Noisy Cross-Entropy Method , 2006, Neural Computation.

[18] Stefan Schaal,et al. Reinforcement learning by reward-weighted regression for operational space control , 2007, ICML '07.

[19] Arkadi Nemirovski,et al. EFFICIENT METHODS IN CONVEX PROGRAMMING , 2007 .

[20] Stefan Schaal,et al. 2008 Special Issue: Reinforcement learning of motor skills with policy gradients , 2008 .

[21] K. Wampler,et al. Optimal gait and form for animal locomotion , 2009, SIGGRAPH 2009.

[22] Yasemin Altun,et al. Relative Entropy Policy Search , 2010 .

[23] Yuval Tassa,et al. MuJoCo: A physics engine for model-based control , 2012, 2012 IEEE/RSJ International Conference on Intelligent Robots and Systems.

[24] Ilya Sutskever,et al. Training Deep and Recurrent Networks with Hessian-Free Optimization , 2012, Neural Networks: Tricks of the Trade.

[25] V. Climenhaga. Markov chains and mixing times , 2013 .

[26] Alex Graves,et al. Playing Atari with Deep Reinforcement Learning , 2013, ArXiv.

[27] Jan Peters,et al. A Survey on Policy Search for Robotics , 2013, Found. Trends Robotics.

[28] Bruno Scherrer,et al. Approximate Dynamic Programming Finally Performs Well in the Game of Tetris , 2013, NIPS.

[29] Daniele Calandriello,et al. Safe Policy Iteration , 2013, ICML.

[30] Honglak Lee,et al. Deep Learning for Real-Time Atari Game Play Using Offline Monte-Carlo Tree Search Planning , 2014, NIPS.

[31] Sergey Levine,et al. Learning Neural Network Policies with Guided Policy Search under Unknown Dynamics , 2014, NIPS.

[32] Razvan Pascanu,et al. Revisiting Natural Gradient for Deep Networks , 2013, ICLR.

[33] Marc G. Bellemare,et al. The Arcade Learning Environment: An Evaluation Platform for General Agents (Extended Abstract) , 2012, IJCAI.