Learning Policy Improvements with Path Integrals

With the goal to generate more scalable algorithms with higher e!ciency and fewer open parameters, reinforcement learning (RL) has recently moved towards combining classical techniques from optimal control and dynamic programming with modern learning techniques from statistical estimation theory. In this vein, this paper suggests the framework of stochastic optimal control with path integrals to derive a novel approach to RL with parametrized policies. While solidly grounded in value function estimation and optimal control based on the stochastic Hamilton-Jacobi-Bellman (HJB) equations, policy improvements can be transformed into an approximation problem of a path integral which has no open parameters other than the exploration noise. The resulting algorithm can be conceived of as modelbased, semi-model-based, or even model free, depending on how the learning problem is structured. Our new algorithm demonstrates interesting similarities with previous RL research in the framework of probability matching and provides intuition why the slightly heuristically motivated probability matching approach can actually perform well. Empirical evaluations demonstrate significant performance improvements over gradient-based policy learning and scalability to high-dimensional control problems. We believe that Policy Improvement with Path Integrals (PI 2 ) o"ers currently one of the most e!cient, numerically robust, and

[1]  B. Øksendal Stochastic differential equations : an introduction with applications , 1987 .

[2]  W. Fleming,et al.  Controlled Markov processes and viscosity solutions , 1992 .

[3]  Robert F. Stengel,et al.  Optimal Control and Estimation , 1994 .

[4]  J. Yong Relations among ODEs, PDEs, FSDEs, BSDEs, and FBSDEs , 1997, Proceedings of the 36th IEEE Conference on Decision and Control.

[5]  Thomas G. Dietterich Adaptive computation and machine learning , 1998 .

[6]  Alex M. Andrew,et al.  ROBOT LEARNING, edited by Jonathan H. Connell and Sridhar Mahadevan, Kluwer, Boston, 1993/1997, xii+240 pp., ISBN 0-7923-9365-1 (Hardback, 218.00 Guilders, $120.00, £89.95). , 1999, Robotica (Cambridge. Print).

[7]  Geoffrey E. Hinton,et al.  Using EM for Reinforcement Learning , 2000 .

[8]  Jun Nakanishi,et al.  Learning Attractor Landscapes for Learning Motor Primitives , 2002, NIPS.

[9]  H. Kappen Linear theory for control of nonlinear stochastic systems. , 2004, Physical review letters.

[10]  Richard S. Sutton,et al.  Reinforcement Learning: An Introduction , 1998, IEEE Trans. Neural Networks.

[11]  H. Kappen Path integrals and symmetry breaking for optimal control theory , 2005, physics/0505066.

[12]  Stefan Schaal,et al.  Natural Actor-Critic , 2003, Neurocomputing.

[13]  Marc Toussaint,et al.  Probabilistic inference for solving discrete and continuous state Markov Decision Processes , 2006, ICML.

[14]  H. Kappen An introduction to stochastic control theory, path integrals and reinforcement learning , 2007 .

[15]  Hilbert J. Kappen,et al.  Graphical Model Inference in Optimal Control of Stochastic Multi-Agent Systems , 2008, J. Artif. Intell. Res..

[16]  Stefan Schaal,et al.  Learning to Control in Operational Space , 2008, Int. J. Robotics Res..

[17]  Jan Peters,et al.  Learning motor primitives for robotics , 2009, 2009 IEEE International Conference on Robotics and Automation.

[18]  Machine Learning of Motor Skills for Robotics, Jan Peters , 2022 .