Exploring parameter space in reinforcement learning

This paper discusses parameter-based exploration methods for reinforcement learning. Parameter-based methods perturb parameters of a general function approximator directly, rather than adding noise to the resulting actions. Parameter-based exploration unifies reinforcement learning and black-box optimization, and has several advantages over action perturbation. We review two recent parameter-exploring algorithms: Natural Evolution Strategies and Policy Gradients with Parameter-Based Exploration. Both outperform state-of-the-art algorithms in several complex high-dimensional tasks commonly found in robot control. Furthermore, we describe how a novel exploration method, State-Dependent Exploration, can modify existing algorithms to mimic exploration in parameter space.

[1]  W. Pinebrook The evolution of strategy. , 1990, Case studies in health administration.

[2]  Michael I. Jordan Attractor dynamics and parallelism in a connectionist sequential machine , 1990 .

[3]  Sebastian Thrun,et al.  The role of exploration in learning control , 1992 .

[4]  R. J. Williams Simple Statistical Gradient-Following Algorithms for Connectionist Reinforcement Learning , 1992, Machine Learning.

[5]  Donald A. Sofge,et al.  Handbook of Intelligent Control: Neural, Fuzzy, and Adaptive Approaches , 1992 .

[6]  James Kennedy,et al.  Particle swarm optimization , 2002, Proceedings of ICNN'95 - International Conference on Neural Networks.

[7]  Rainer Storn,et al.  Differential Evolution – A Simple and Efficient Heuristic for global Optimization over Continuous Spaces , 1997, J. Glob. Optim..

[8]  Stewart W. Wilson,et al.  From Animals to Animats 5. Proceedings of the Fifth International Conference on Simulation of Adaptive Behavior , 1997 .

[9]  Shun-ichi Amari,et al.  Why natural gradient? , 1998, Proceedings of the 1998 IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP '98 (Cat. No.98CH36181).

[10]  Jürgen Schmidhuber,et al.  Efficient model-based exploration , 1998 .

[11]  Yishay Mansour,et al.  Policy Gradient Methods for Reinforcement Learning with Function Approximation , 1999, NIPS.

[12]  Peter L. Bartlett,et al.  Reinforcement Learning in POMDP's via Direct Gradient Ascent , 2000, ICML.

[13]  J. A. Lozano,et al.  Estimation of Distribution Algorithms: A New Tool for Evolutionary Computation , 2001 .

[14]  Nikolaus Hansen,et al.  Completely Derandomized Self-Adaptation in Evolution Strategies , 2001, Evolutionary Computation.

[15]  Pedro Larrañaga,et al.  Estimation of Distribution Algorithms , 2002, Genetic Algorithms and Evolutionary Computation.

[16]  Petros Koumoutsakos,et al.  Reducing the Time Complexity of the Derandomized Evolution Strategy with Covariance Matrix Adaptation (CMA-ES) , 2003, Evolutionary Computation.

[17]  Jun Nakanishi,et al.  Learning Movement Primitives , 2005, ISRR.

[18]  Douglas Aberdeen,et al.  Policy-Gradient Algorithms for Partially Observable Markov Decision Processes , 2003 .

[19]  Pat Langley,et al.  Editorial: On Machine Learning , 1986, Machine Learning.

[20]  Peter Dayan,et al.  Q-learning , 1992, Machine Learning.

[21]  Peter Dayan,et al.  Technical Note: Q-Learning , 1992, Machine Learning.

[22]  Petros Koumoutsakos,et al.  Learning Probability Distributions in Continuous Evolutionary Algorithms - a Comparative Review , 2004, Nat. Comput..

[23]  Martin A. Riedmiller Neural Fitted Q Iteration - First Experiences with a Data Efficient Neural Reinforcement Learning Method , 2005, ECML.

[24]  Richard S. Sutton,et al.  Reinforcement Learning: An Introduction , 2005, IEEE Transactions on Neural Networks.

[25]  Rémi Munos,et al.  Policy Gradient in Continuous Time , 2006, J. Mach. Learn. Res..

[26]  Stefan Schaal,et al.  Natural Actor-Critic , 2003, ECML.

[27]  Stefan Schaal,et al.  Policy Gradient Methods for Robotics , 2006, 2006 IEEE/RSJ International Conference on Intelligent Robots and Systems.

[28]  Martin Lauer,et al.  Making a Robot Learn to Play Soccer Using Reward and Punishment , 2007, KI.

[29]  Riccardo Poli,et al.  Particle swarm optimization , 1995, Swarm Intelligence.

[30]  M.A. Wiering,et al.  Reinforcement Learning in Continuous Action Spaces , 2007, 2007 IEEE International Symposium on Approximate Dynamic Programming and Reinforcement Learning.

[31]  Martin A. Riedmiller,et al.  Evaluation of Policy Gradient Methods and Variants on the Cart-Pole Benchmark , 2007, 2007 IEEE International Symposium on Approximate Dynamic Programming and Reinforcement Learning.

[32]  Jürgen Schmidhuber,et al.  State-Dependent Exploration for Policy Gradient Methods , 2008, ECML/PKDD.

[33]  Frank Sehnke,et al.  Policy Gradients with Parameter-Based Exploration for Control , 2008, ICANN.

[34]  Tom Schaul,et al.  Natural Evolution Strategies , 2008, 2008 IEEE Congress on Evolutionary Computation (IEEE World Congress on Computational Intelligence).

[35]  Tom Schaul,et al.  Efficient natural evolution strategies , 2009, GECCO '09.

[36]  Tom Schaul,et al.  Stochastic search using the natural gradient , 2009, ICML '09.

[37]  D. E. Ivanov Institute of Applied Mathematics and Mechanics NAS of Ukraine, Donetsk PARALLEL FAULT SIMULATION ON MULTI-CORE PROCESSORS , 2009 .

[38]  Frank Sehnke,et al.  Multimodal Parameter-exploring Policy Gradients , 2010, 2010 Ninth International Conference on Machine Learning and Applications.

[39]  Frank Sehnke,et al.  Parameter-exploring policy gradients , 2010, Neural Networks.

[40]  P. Cochat,et al.  Et al , 2008, Archives de pediatrie : organe officiel de la Societe francaise de pediatrie.