Reinforcement learning of motor skills in high dimensions: A path integral approach

Reinforcement learning (RL) is one of the most general approaches to learning control. Its applicability to complex motor systems, however, has been largely impossible so far due to the computational difficulties that reinforcement learning encounters in high dimensional continuous state-action spaces. In this paper, we derive a novel approach to RL for parameterized control policies based on the framework of stochastic optimal control with path integrals. While solidly grounded in optimal control theory and estimation theory, the update equations for learning are surprisingly simple and have no danger of numerical instabilities as neither matrix inversions nor gradient learning rates are required. Empirical evaluations demonstrate significant performance improvements over gradient-based policy learning and scalability to high-dimensional control problems. Finally, a learning experiment on a robot dog illustrates the functionality of our algorithm in a real-world scenario. We believe that our new algorithm, Policy Improvement with Path Integrals (PI2), offers currently one of the most efficient, numerically robust, and easy to implement algorithms for RL in robotics.

[1]  Richard S. Sutton,et al.  Neuronlike adaptive elements that can solve difficult learning control problems , 1983, IEEE Transactions on Systems, Man, and Cybernetics.

[2]  B. Øksendal Stochastic Differential Equations , 1985 .

[3]  B. Øksendal Stochastic differential equations : an introduction with applications , 1987 .

[4]  Vijaykumar Gullapalli,et al.  A stochastic reinforcement learning algorithm for learning real-valued functions , 1990, Neural Networks.

[5]  M. James Controlled markov processes and viscosity solutions , 1994 .

[6]  Robert F. Stengel,et al.  Optimal Control and Estimation , 1994 .

[7]  J. Yong Relations among ODEs, PDEs, FSDEs, BSDEs, and FBSDEs , 1997, Proceedings of the 36th IEEE Conference on Decision and Control.

[8]  Geoffrey E. Hinton,et al.  Using Expectation-Maximization for Reinforcement Learning , 1997, Neural Computation.

[9]  Christopher G. Atkeson,et al.  Constructive Incremental Learning from Only Local Information , 1998, Neural Computation.

[10]  Yishay Mansour,et al.  Policy Gradient Methods for Reinforcement Learning with Function Approximation , 1999, NIPS.

[11]  Shun-ichi Amari,et al.  Natural Gradient Learning for Over- and Under-Complete Bases in ICA , 1999, Neural Computation.

[12]  Geoffrey E. Hinton,et al.  Using EM for Reinforcement Learning , 2000 .

[13]  Peter L. Bartlett,et al.  Infinite-Horizon Policy-Gradient Estimation , 2001, J. Artif. Intell. Res..

[14]  Jun Nakanishi,et al.  Learning Attractor Landscapes for Learning Motor Primitives , 2002, NIPS.

[15]  Ronald J. Williams,et al.  Simple Statistical Gradient-Following Algorithms for Connectionist Reinforcement Learning , 2004, Machine Learning.

[16]  H. Kappen Linear theory for control of nonlinear stochastic systems. , 2004, Physical review letters.

[17]  H. Kappen Path integrals and symmetry breaking for optimal control theory , 2005, physics/0505066.

[18]  Marc Toussaint,et al.  Probabilistic inference for solving discrete and continuous state Markov Decision Processes , 2006, ICML.

[19]  Emanuel Todorov,et al.  Linearly-solvable Markov decision problems , 2006, NIPS.

[20]  Stefan Schaal,et al.  Reinforcement Learning for Parameterized Motor Primitives , 2006, The 2006 IEEE International Joint Conference on Neural Network Proceedings.

[21]  Hilbert J. Kappen,et al.  Stochastic Optimal Control in Continuous Space-Time Multi-Agent Systems , 2006, UAI.

[22]  Stefan Schaal,et al.  Policy Gradient Methods for Robotics , 2006, 2006 IEEE/RSJ International Conference on Intelligent Robots and Systems.

[23]  Mohammad Ghavamzadeh,et al.  Bayesian actor-critic algorithms , 2007, ICML '07.

[24]  H. Kappen An introduction to stochastic control theory, path integrals and reinforcement learning , 2007 .

[25]  Emanuel Todorov,et al.  General duality between optimal control and estimation , 2008, 2008 47th IEEE Conference on Decision and Control.

[26]  Stefan Schaal,et al.  Natural Actor-Critic , 2003, Neurocomputing.

[27]  Hilbert J. Kappen,et al.  Graphical Model Inference in Optimal Control of Stochastic Multi-Agent Systems , 2008, J. Artif. Intell. Res..

[28]  Christos Dimitrakakis,et al.  Rollout sampling approximate policy iteration , 2008, Machine Learning.

[29]  Jan Peters,et al.  Machine Learning for motor skills in robotics , 2008, Künstliche Intell..

[30]  Stefan Schaal,et al.  Learning to Control in Operational Space , 2008, Int. J. Robotics Res..

[31]  Stefan Schaal,et al.  2008 Special Issue: Reinforcement learning of motor skills with policy gradients , 2008 .

[32]  Jan Peters,et al.  Learning motor primitives for robotics , 2009, 2009 IEEE International Conference on Robotics and Automation.

[33]  Emanuel Todorov,et al.  Efficient computation of optimal actions , 2009, Proceedings of the National Academy of Sciences.

[34]  Marc Toussaint,et al.  Learning model-free robot control by a Monte Carlo EM algorithm , 2009, Auton. Robots.

[35]  Carl E. Rasmussen,et al.  Gaussian process dynamic programming , 2009, Neurocomputing.

[36]  Vicenç Gómez,et al.  Optimal control as a graphical model inference problem , 2009, Machine Learning.

[37]  H. Kappen A path integral approach to agent planning , 2022 .