Policy Gradient Methods for Robotics

The acquisition and improvement of motor skills and control policies for robotics from trial and error is of essential importance if robots should ever leave precisely pre-structured environments. However, to date only few existing reinforcement learning methods have been scaled into the domains of high-dimensional robots such as manipulator, legged or humanoid robots. Policy gradient methods remain one of the few exceptions and have found a variety of applications. Nevertheless, the application of such methods is not without peril if done in an uninformed manner. In this paper, we give an overview on learning with policy gradient methods for robotics with a strong focus on recent advances in the field. We outline previous applications to robotics and show how the most recently developed methods can significantly improve learning performance. Finally, we evaluate our most promising algorithm in the application of hitting a baseball with an anthropomorphic arm

[1]  M. Ciletti,et al.  The computation and theory of optimal control , 1972 .

[2]  David Q. Mayne,et al.  Differential dynamic programming , 1972, The Mathematical Gazette.

[3]  L. Hasdorff Gradient Optimization and Nonlinear Control , 1976 .

[4]  Peter W. Glynn,et al.  Likelilood ratio gradient estimation: an overview , 1987, WSC '87.

[5]  R. Fletcher Practical Methods of Optimization , 1988 .

[6]  Peter W. Glynn,et al.  Likelihood ratio gradient estimation for stochastic systems , 1990, CACM.

[7]  Peter W. Glynn,et al.  Gradient estimation for ratios , 1991, 1991 Winter Simulation Conference Proceedings..

[8]  V. Gullapalli,et al.  Associative reinforcement learning of real-valued functions , 1991, Conference Proceedings 1991 IEEE International Conference on Systems, Man, and Cybernetics.

[9]  Oliver G. Selfridge,et al.  Real-time learning: a ball on a beam , 1992, [Proceedings 1992] IJCNN International Joint Conference on Neural Networks.

[10]  V. Gullapalli,et al.  Acquiring robot skills via reinforcement learning , 1994, IEEE Control Systems.

[11]  S. Schaal,et al.  A Kendama Learning Robot Based on Bi-directional Theory , 1996, Neural Networks.

[12]  Judy A. Franklin,et al.  Biped dynamic walking using reinforcement learning , 1997, Robotics Auton. Syst..

[13]  Vijay Balasubramanian,et al.  Statistical Inference, Occam's Razor, and Statistical Mechanics on the Space of Probability Distributions , 1996, Neural Computation.

[14]  Shun-ichi Amari,et al.  Natural Gradient Works Efficiently in Learning , 1998, Neural Computation.

[15]  Shigenobu Kobayashi,et al.  Reinforcement learning for continuous action using stochastic gradient ascent , 1998 .

[16]  Yishay Mansour,et al.  Policy Gradient Methods for Reinforcement Learning with Function Approximation , 1999, NIPS.

[17]  R. Fletcher,et al.  Practical Methods of Optimization: Fletcher/Practical Methods of Optimization , 2000 .

[18]  Jon Rigelsford,et al.  Modelling and Control of Robot Manipulators , 2000 .

[19]  Michael I. Jordan,et al.  PEGASUS: A policy search method for large MDPs and POMDPs , 2000, UAI.

[20]  J. Baxter,et al.  Direct gradient-based reinforcement learning , 2000, 2000 IEEE International Symposium on Circuits and Systems. Emerging Technologies for the 21st Century. Proceedings (IEEE Cat No.00CH36353).

[21]  Lex Weaver,et al.  The Optimal Reward Baseline for Gradient-Based Reinforcement Learning , 2001, UAI.

[22]  Sham M. Kakade,et al.  A Natural Policy Gradient , 2001, NIPS.

[23]  A. Berny,et al.  Statistical machine learning and combinatorial optimization , 2001 .

[24]  Shin Ishii,et al.  Reinforcement Learning for Biped Locomotion , 2002, ICANN.

[25]  Alison L Gibbs,et al.  On Choosing and Bounding Probability Metrics , 2002, math/0209021.


[27]  Jun Nakanishi,et al.  Learning rhythmic movements by demonstration using nonlinear oscillators , 2002, IEEE/RSJ International Conference on Intelligent Robots and Systems.

[28]  Stefan Schaal,et al.  Reinforcement Learning for Humanoid Robotics , 2003 .

[29]  Jeff G. Schneider,et al.  Covariant Policy Search , 2003, IJCAI.

[30]  Noah J. Cowan,et al.  Efficient Gradient Estimation for Motor Control Learning , 2002, UAI.

[31]  Peter L. Bartlett,et al.  Variance Reduction Techniques for Gradient Estimates in Reinforcement Learning , 2001, J. Mach. Learn. Res..

[32]  Shin Ishii,et al.  Reinforcement Learning for CPG-Driven Biped Robot , 2004, AAAI.

[33]  Tim Hesterberg,et al.  Introduction to Stochastic Search and Optimization: Estimation, Simulation, and Control , 2004, Technometrics.

[34]  Peter Stone,et al.  Policy gradient reinforcement learning for fast quadrupedal locomotion , 2004, IEEE International Conference on Robotics and Automation, 2004. Proceedings. ICRA '04. 2004.

[35]  Ronald J. Williams Simple statistical gradient-following algorithms for connectionist reinforcement learning , 2004, Machine Learning.

[36]  Shin Ishii,et al.  Natural Policy Gradient Reinforcement Learning for a CPG Control of a Biped Robot , 2004, PPSN.

[37]  Takayuki Kanda,et al.  Robot behavior adaptation for human-robot interaction based on policy gradient reinforcement learning , 2005, 2005 IEEE/RSJ International Conference on Intelligent Robots and Systems.

[38]  H. Sebastian Seung,et al.  Learning to Walk in 20 Minutes , 2005 .

[39]  Jun Morimoto,et al.  Learning CPG Sensory Feedback with Policy Gradient for Biped Locomotion for a Full-Body Humanoid , 2005, AAAI.

[40]  Stefan Schaal,et al.  Natural Actor-Critic , 2003, ECML.

[41]  Florentin Wörgötter,et al.  Fast Biped Walking with a Sensor-driven Neuronal Controller and Real-time Online Learning , 2006, Int. J. Robotics Res..

[42]  James C. Spall,et al.  Introduction to Stochastic Search and Optimization. Estimation, Simulation, and Control (Spall, J.C. , 2007 .

[43]  Jan Peters,et al.  Machine Learning for motor skills in robotics , 2008, Künstliche Intell..