Policy Gradients with Parameter-Based Exploration for Control

We present a model-free reinforcement learning method for partially observable Markov decision problems. Our method estimates a likelihood gradient by sampling directly in parameter space, which leads to lower variance gradient estimates than those obtained by policy gradient methods such as REINFORCE. For several complex control tasks, including robust standing with a humanoid robot, we show that our method outperforms well-known algorithms from the fields of policy gradients, finite difference methods and population based heuristics. We also provide a detailed analysis of the differences between our method and the other algorithms.

[1]  Michael I. Jordan Attractor dynamics and parallelism in a connectionist sequential machine , 1990 .

[2]  Hans-Paul Schwefel,et al.  Evolution and Optimum Seeking: The Sixth Generation , 1993 .

[3]  Hans-Paul Schwefel,et al.  Evolution and optimum seeking , 1995, Sixth-generation computer technology series.

[4]  Judy A. Franklin,et al.  Biped dynamic walking using reinforcement learning , 1997, Robotics Auton. Syst..

[5]  James C. Spall,et al.  AN OVERVIEW OF THE SIMULTANEOUS PERTURBATION METHOD FOR EFFICIENT OPTIMIZATION , 1998 .

[6]  Yishay Mansour,et al.  Policy Gradient Methods for Reinforcement Learning with Function Approximation , 1999, NIPS.

[7]  Peter L. Bartlett,et al.  Reinforcement Learning in POMDP's via Direct Gradient Ascent , 2000, ICML.

[8]  Nikolaus Hansen,et al.  Completely Derandomized Self-Adaptation in Evolution Strategies , 2001, Evolutionary Computation.

[9]  Douglas Aberdeen,et al.  Policy-Gradient Algorithms for Partially Observable Markov Decision Processes , 2003 .

[10]  Pat Langley,et al.  Editorial: On Machine Learning , 1986, Machine Learning.

[11]  Ronald J. Williams,et al.  Simple Statistical Gradient-Following Algorithms for Connectionist Reinforcement Learning , 2004, Machine Learning.

[12]  Nicol N. Schraudolph,et al.  Fast Online Policy Gradient Learning with SMD Gain Vector Adaptation , 2005, NIPS.

[13]  Stefan Schaal,et al.  Natural Actor-Critic , 2003, Neurocomputing.

[14]  Stefan Schaal,et al.  Policy Gradient Methods for Robotics , 2006, 2006 IEEE/RSJ International Conference on Intelligent Robots and Systems.

[15]  Martin Lauer,et al.  Making a Robot Learn to Play Soccer Using Reward and Punishment , 2007, KI.

[16]  Michael Beetz,et al.  KI 2007: Advances in Artificial Intelligence, 30th Annual German Conference on AI, KI 2007, Osnabrück, Germany, September 10-13, 2007, Proceedings , 2007, KI.

[17]  Martin A. Riedmiller,et al.  Evaluation of Policy Gradient Methods and Variants on the Cart-Pole Benchmark , 2007, 2007 IEEE International Symposium on Approximate Dynamic Programming and Reinforcement Learning.