Reinforcement Learning for Parameterized Motor Primitives

One of the major challenges in both action generation for robotics and in the understanding of human motor control is to learn the "building blocks of movement generation", called motor primitives. Motor primitives, as used in this paper, are parameterized control policies such as splines or nonlinear differential equations with desired attractor properties. While a lot of progress has been made in teaching parameterized motor primitives using supervised or imitation learning, the selfimprovement by interaction of the system with the environment remains a challenging problem. In this paper, we evaluate different reinforcement learning approaches for improving the performance of parameterized motor primitives. For pursuing this goal, we highlight the difficulties with current reinforcement learning methods, and outline both established and novel algorithms for the gradient-based improvement of parameterized policies. We compare these algorithms in the context of motor primitive learning, and show that our most modern algorithm, the Episodic Natural Actor-Critic outperforms previous algorithms by at least an order of magnitude. We demonstrate the efficiency of this reinforcement learning method in the application of learning to hit a baseball with an anthropomorphic robot arm.

[1]  M. Ciletti,et al.  The computation and theory of optimal control , 1972 .

[2]  David Q. Mayne,et al.  Differential dynamic programming , 1972, The Mathematical Gazette.

[3]  L. Hasdorff Gradient Optimization and Nonlinear Control , 1976 .

[4]  Peter W. Glynn,et al.  Likelilood ratio gradient estimation: an overview , 1987, WSC '87.

[5]  R. Fletcher Practical Methods of Optimization , 1988 .

[6]  Peter W. Glynn,et al.  Likelihood ratio gradient estimation for stochastic systems , 1990, CACM.

[7]  V. Gullapalli,et al.  Associative reinforcement learning of real-valued functions , 1991, Conference Proceedings 1991 IEEE International Conference on Systems, Man, and Cybernetics.

[8]  Oliver G. Selfridge,et al.  Real-time learning: a ball on a beam , 1992, [Proceedings 1992] IJCNN International Joint Conference on Neural Networks.

[9]  V. Gullapalli,et al.  Acquiring robot skills via reinforcement learning , 1994, IEEE Control Systems.

[10]  S. Schaal,et al.  A Kendama Learning Robot Based on Bi-directional Theory , 1996, Neural Networks.

[11]  Judy A. Franklin,et al.  Biped dynamic walking using reinforcement learning , 1997, Robotics Auton. Syst..

[12]  Vijay Balasubramanian,et al.  Statistical Inference, Occam's Razor, and Statistical Mechanics on the Space of Probability Distributions , 1996, Neural Computation.

[13]  Shun-ichi Amari,et al.  Natural Gradient Works Efficiently in Learning , 1998, Neural Computation.

[14]  Shigenobu Kobayashi,et al.  Reinforcement learning for continuous action using stochastic gradient ascent , 1998 .

[15]  Yishay Mansour,et al.  Policy Gradient Methods for Reinforcement Learning with Function Approximation , 1999, NIPS.

[16]  R. Fletcher,et al.  Practical Methods of Optimization: Fletcher/Practical Methods of Optimization , 2000 .

[17]  Jon Rigelsford,et al.  Modelling and Control of Robot Manipulators , 2000 .

[18]  Michael I. Jordan,et al.  PEGASUS: A policy search method for large MDPs and POMDPs , 2000, UAI.

[19]  J. Baxter,et al.  Direct gradient-based reinforcement learning , 2000, 2000 IEEE International Symposium on Circuits and Systems. Emerging Technologies for the 21st Century. Proceedings (IEEE Cat No.00CH36353).

[20]  Lex Weaver,et al.  The Optimal Reward Baseline for Gradient-Based Reinforcement Learning , 2001, UAI.

[21]  Sham M. Kakade,et al.  A Natural Policy Gradient , 2001, NIPS.

[22]  Shin Ishii,et al.  Reinforcement Learning for Biped Locomotion , 2002, ICANN.

[23]  Alison L Gibbs,et al.  On Choosing and Bounding Probability Metrics , 2002, math/0209021.

[24]  Jun Nakanishi,et al.  Learning Attractor Landscapes for Learning Motor Primitives , 2002, NIPS.

[25]  J. Spall STOCHASTIC OPTIMIZATION , 2002 .

[26]  Jun Nakanishi,et al.  Learning rhythmic movements by demonstration using nonlinear oscillators , 2002, IEEE/RSJ International Conference on Intelligent Robots and Systems.

[27]  James C. Spall,et al.  Introduction to stochastic search and optimization - estimation, simulation, and control , 2003, Wiley-Interscience series in discrete mathematics and optimization.

[28]  Stefan Schaal,et al.  Reinforcement Learning for Humanoid Robotics , 2003 .

[29]  Jeff G. Schneider,et al.  Covariant Policy Search , 2003, IJCAI.

[30]  Noah J. Cowan,et al.  Efficient Gradient Estimation for Motor Control Learning , 2002, UAI.

[31]  Peter L. Bartlett,et al.  Variance Reduction Techniques for Gradient Estimates in Reinforcement Learning , 2001, J. Mach. Learn. Res..

[32]  Shin Ishii,et al.  Reinforcement Learning for CPG-Driven Biped Robot , 2004, AAAI.

[33]  Tim Hesterberg,et al.  Introduction to Stochastic Search and Optimization: Estimation, Simulation, and Control , 2004, Technometrics.

[34]  Peter Stone,et al.  Policy gradient reinforcement learning for fast quadrupedal locomotion , 2004, IEEE International Conference on Robotics and Automation, 2004. Proceedings. ICRA '04. 2004.

[35]  Ronald J. Williams,et al.  Simple Statistical Gradient-Following Algorithms for Connectionist Reinforcement Learning , 2004, Machine Learning.

[36]  Shin Ishii,et al.  Natural Policy Gradient Reinforcement Learning for a CPG Control of a Biped Robot , 2004, PPSN.

[37]  Takayuki Kanda,et al.  Robot behavior adaptation for human-robot interaction based on policy gradient reinforcement learning , 2005, 2005 IEEE/RSJ International Conference on Intelligent Robots and Systems.

[38]  H. Sebastian Seung,et al.  Learning to Walk in 20 Minutes , 2005 .

[39]  Jun Morimoto,et al.  Learning CPG Sensory Feedback with Policy Gradient for Biped Locomotion for a Full-Body Humanoid , 2005, AAAI.

[40]  Stefan Schaal,et al.  Natural Actor-Critic , 2003, Neurocomputing.

[41]  James C. Spall,et al.  Introduction to Stochastic Search and Optimization. Estimation, Simulation, and Control (Spall, J.C. , 2007 .

[42]  Jan Peters,et al.  Machine Learning for motor skills in robotics , 2008, Künstliche Intell..

[43]  P. Glynn LIKELIHOOD RATIO GRADIENT ESTIMATION : AN OVERVIEW by , 2022 .