A study in direct policy search

Reinforcement learning in partially observable environments is an important and challenging problem. Since many value function-based methods have been shown to perform poorly, we study direct policy search methods instead. The aim of this work is to advance the state-of-the-art in direct policy search and black box optimization. Its contributions include four new algorithms: (1) a novel algorithm which backpropagates recurrent policy gradients through time, as such learning both memory and a policy at the same time; (2) an instantiation of the well-known EM algorithm adapted to learning policies in partially observable environments; (3) Fitness Expectation-Maximization, a black box search method derived from EM; (4) Natural Evolution Strategies, an alternative to conventional evolutionary methods that uses a natural gradient to perform stochastic search. Experimental results with these four methods demonstrate competitive performance on a variety of test problems and benchmarks.

[1]  Jürgen Schmidhuber,et al.  Learning Precise Timing with LSTM Recurrent Networks , 2003, J. Mach. Learn. Res..

[2]  Stefan Schaal,et al.  Policy Gradient Methods for Robotics , 2006, 2006 IEEE/RSJ International Conference on Intelligent Robots and Systems.

[3]  Risto Miikkulainen,et al.  Incremental Evolution of Complex General Behavior , 1997, Adapt. Behav..

[4]  Mauro Birattari,et al.  Swarm Intelligence , 2012, Lecture Notes in Computer Science.

[5]  Risto Miikkulainen,et al.  Efficient Non-linear Control Through Neuroevolution , 2006, ECML.

[6]  Pieter Bram Bakker,et al.  The state of mind : reinforcement learning with recurrent neural networks , 2004 .

[7]  Martin A. Riedmiller Neural Fitted Q Iteration - First Experiences with a Data Efficient Neural Reinforcement Learning Method , 2005, ECML.

[8]  Leemon C. Baird,et al.  Residual Algorithms: Reinforcement Learning with Function Approximation , 1995, ICML.

[9]  Mahesan Niranjan,et al.  On-line Q-learning using connectionist systems , 1994 .

[10]  Stefan Schaal,et al.  2008 Special Issue: Reinforcement learning of motor skills with policy gradients , 2008 .

[11]  Risto Miikkulainen,et al.  Evolving Neural Networks through Augmenting Topologies , 2002, Evolutionary Computation.

[12]  J. Schmidhuber,et al.  Framewise phoneme classification with bidirectional LSTM networks , 2005, Proceedings. 2005 IEEE International Joint Conference on Neural Networks, 2005..

[13]  Nicol N. Schraudolph,et al.  Fast Online Policy Gradient Learning with SMD Gain Vector Adaptation , 2005, NIPS.

[14]  Shalabh Bhatnagar,et al.  Incremental Natural Actor-Critic Algorithms , 2007, NIPS.

[15]  P. Glynn Optimization of stochastic systems via simulation , 1989, WSC '89.

[16]  Risto Miikkulainen,et al.  Solving Non-Markovian Control Tasks with Neuro-Evolution , 1999, IJCAI.

[17]  Risto Miikkulainen,et al.  Efficient Reinforcement Learning through Symbiotic Evolution , 1996, Machine Learning.

[18]  Gerald Tesauro,et al.  TD-Gammon, a Self-Teaching Backgammon Program, Achieves Master-Level Play , 1994, Neural Computation.

[19]  Andrew McCallum,et al.  Instance-Based Utile Distinctions for Reinforcement Learning with Hidden State , 1995, ICML.

[20]  Christian Igel,et al.  Neuroevolution for reinforcement learning using evolution strategies , 2003, The 2003 Congress on Evolutionary Computation, 2003. CEC '03..

[21]  J. Baxter,et al.  Direct gradient-based reinforcement learning , 2000, 2000 IEEE International Symposium on Circuits and Systems. Emerging Technologies for the 21st Century. Proceedings (IEEE Cat No.00CH36353).

[22]  Douglas Aberdeen,et al.  Policy-Gradient Algorithms for Partially Observable Markov Decision Processes , 2003 .

[23]  Tom Schaul,et al.  Efficient natural evolution strategies , 2009, GECCO.

[24]  Christian Igel,et al.  Similarities and differences between policy gradient methods and evolution strategies , 2008, ESANN.

[25]  Long-Ji Lin,et al.  Reinforcement learning for robots using neural networks , 1992 .

[26]  Nikolaus Hansen,et al.  Completely Derandomized Self-Adaptation in Evolution Strategies , 2001, Evolutionary Computation.

[27]  Jürgen Schmidhuber,et al.  Recurrent policy gradients , 2010, Log. J. IGPL.

[28]  Shun-ichi Amari,et al.  Why natural gradient? , 1998, Proceedings of the 1998 IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP '98 (Cat. No.98CH36181).

[29]  Risto Miikkulainen,et al.  Efficient evolution of neural networks through complexification , 2004 .

[30]  Tom Schaul,et al.  Fitness Expectation Maximization , 2008, PPSN.

[31]  D. Prokhorov Toward effective combination of off-line and on-line training in ADP framework , 2007, 2007 IEEE International Symposium on Approximate Dynamic Programming and Reinforcement Learning.

[32]  Shun-ichi Amari,et al.  Natural Gradient Works Efficiently in Learning , 1998, Neural Computation.

[33]  Peter Stone,et al.  Policy gradient reinforcement learning for fast quadrupedal locomotion , 2004, IEEE International Conference on Robotics and Automation, 2004. Proceedings. ICRA '04. 2004.

[34]  Shalabh Bhatnagar,et al.  Fast gradient-descent methods for temporal-difference learning with linear function approximation , 2009, ICML '09.

[35]  Stefan Schaal,et al.  Reinforcement learning by reward-weighted regression for operational space control , 2007, ICML '07.

[36]  Judy A. Franklin,et al.  Biped dynamic walking using reinforcement learning , 1997, Robotics Auton. Syst..

[37]  Jürgen Schmidhuber,et al.  Gödel Machines: Fully Self-referential Optimal Universal Self-improvers , 2007, Artificial General Intelligence.

[38]  Dirk P. Kroese,et al.  Cross‐Entropy Method , 2011 .

[39]  Tom Schaul,et al.  Stochastic search using the natural gradient , 2009, ICML '09.

[40]  Eduardo Sontag,et al.  Turing computability with neural nets , 1991 .

[41]  Leslie Pack Kaelbling,et al.  Learning Policies for Partially Observable Environments: Scaling Up , 1997, ICML.

[42]  Julian Togelius,et al.  Point-to-Point Car Racing: an Initial Study of Evolution Versus Temporal Difference Learning , 2007, 2007 IEEE Symposium on Computational Intelligence and Games.

[43]  Marcus Hutter,et al.  Universal Artificial Intelligence: Sequential Decisions Based on Algorithmic Probability (Texts in Theoretical Computer Science. An EATCS Series) , 2006 .

[44]  J. Spall,et al.  Theoretical framework for comparing several popular stochastic optimization approaches , 2002 .

[45]  Yoshua Bengio,et al.  Gradient Flow in Recurrent Nets: the Difficulty of Learning Long-Term Dependencies , 2001 .

[46]  Jürgen Schmidhuber,et al.  Long Short-Term Memory , 1997, Neural Computation.

[47]  Julian Togelius,et al.  The WCCI 2008 simulated car racing competition , 2008, 2008 IEEE Symposium On Computational Intelligence and Games.

[48]  P J Webros BACKPROPAGATION THROUGH TIME: WHAT IT DOES AND HOW TO DO IT , 1990 .

[49]  Leslie Pack Kaelbling,et al.  Planning and Acting in Partially Observable Stochastic Domains , 1998, Artif. Intell..

[50]  Stuart J. Russell,et al.  Approximating Optimal Policies for Partially Observable Stochastic Domains , 1995, IJCAI.

[51]  Michael I. Jordan,et al.  Learning Without State-Estimation in Partially Observable Markovian Decision Processes , 1994, ICML.

[52]  Hans-Paul Schwefel,et al.  Evolution strategies – A comprehensive introduction , 2002, Natural Computing.

[53]  Chris Watkins,et al.  Learning from delayed rewards , 1989 .

[54]  Kee-Eung Kim,et al.  Learning Finite-State Controllers for Partially Observable Environments , 1999, UAI.

[55]  Tom Schaul,et al.  Natural Evolution Strategies , 2008, 2008 IEEE Congress on Evolutionary Computation (IEEE World Congress on Computational Intelligence).

[56]  Richard S. Sutton,et al.  Reinforcement Learning: An Introduction , 1998, IEEE Trans. Neural Networks.

[57]  Jing J. Liang,et al.  Problem Definitions and Evaluation Criteria for the CEC 2005 Special Session on Real-Parameter Optimization , 2005 .

[58]  Shimon Whiteson,et al.  Empirical Studies in Action Selection with Reinforcement Learning , 2007, Adapt. Behav..

[59]  Hans-Georg Beyer,et al.  Toward a Theory of Evolution Strategies: Self-Adaptation , 1995, Evolutionary Computation.

[60]  Vijaykumar Gullapalli,et al.  A stochastic reinforcement learning algorithm for learning real-valued functions , 1990, Neural Networks.

[61]  R. J. Williams,et al.  Simple Statistical Gradient-Following Algorithms for Connectionist Reinforcement Learning , 2004, Machine Learning.

[62]  Tom Schaul,et al.  Episodic Reinforcement Learning by Logistic Reward-Weighted Regression , 2008, ICANN.

[63]  James C. Spall,et al.  Stochastic optimization and the simultaneous perturbation method , 1999, WSC '99.

[64]  A. P. Wieland,et al.  Evolving neural network controllers for unstable systems , 1991, IJCNN-91-Seattle International Joint Conference on Neural Networks.

[65]  Hans-Paul Schwefel,et al.  TWO-PHASE NOZZLE AND HOLLOW CORE JET EXPERIMENTS. , 1970 .

[66]  Ingo Rechenberg,et al.  Evolutionsstrategie : Optimierung technischer Systeme nach Prinzipien der biologischen Evolution , 1973 .

[67]  Yishay Mansour,et al.  Policy Gradient Methods for Reinforcement Learning with Function Approximation , 1999, NIPS.

[68]  Lonnie Chrisman,et al.  Reinforcement Learning with Perceptual Aliasing: The Perceptual Distinctions Approach , 1992, AAAI.

[69]  S. Shankar Sastry,et al.  Autonomous Helicopter Flight via Reinforcement Learning , 2003, NIPS.

[70]  Peter L. Bartlett,et al.  Experiments with Infinite-Horizon, Policy-Gradient Estimation , 2001, J. Artif. Intell. Res..

[71]  Richard S. Sutton,et al.  Learning to predict by the methods of temporal differences , 1988, Machine Learning.

[72]  Matthew Saffell,et al.  Learning to trade via direct reinforcement , 2001, IEEE Trans. Neural Networks.

[73]  Bram Bakker,et al.  Reinforcement Learning with Long Short-Term Memory , 2001, NIPS.

[74]  Andrew McCallum,et al.  Overcoming Incomplete Perception with Utile Distinction Memory , 1993, ICML.

[75]  Julian Togelius,et al.  Evolution of Neural Networks for Helicopter Control: Why Modularity Matters , 2006, 2006 IEEE International Conference on Evolutionary Computation.

[76]  John N. Tsitsiklis,et al.  Neuro-Dynamic Programming , 1996, Encyclopedia of Machine Learning.

[77]  Jürgen Schmidhuber,et al.  Solving Deep Memory POMDPs with Recurrent Policy Gradients , 2007, ICANN.

[78]  Risto Miikkulainen,et al.  Accelerated Neural Evolution through Cooperatively Coevolved Synapses , 2008, J. Mach. Learn. Res..

[79]  Vijaykumar Gullapalli,et al.  Reinforcement learning and its application to control , 1992 .

[80]  Geoffrey E. Hinton,et al.  Using Expectation-Maximization for Reinforcement Learning , 1997, Neural Computation.