论文信息 - A study in direct policy search

A study in direct policy search

Reinforcement learning in partially observable environments is an important and challenging problem. Since many value function-based methods have been shown to perform poorly, we study direct policy search methods instead. The aim of this work is to advance the state-of-the-art in direct policy search and black box optimization. Its contributions include four new algorithms: (1) a novel algorithm which backpropagates recurrent policy gradients through time, as such learning both memory and a policy at the same time; (2) an instantiation of the well-known EM algorithm adapted to learning policies in partially observable environments; (3) Fitness Expectation-Maximization, a black box search method derived from EM; (4) Natural Evolution Strategies, an alternative to conventional evolutionary methods that uses a natural gradient to perform stochastic search. Experimental results with these four methods demonstrate competitive performance on a variety of test problems and benchmarks.

Daan Wierstra | Daan Wierstra

[1] Jürgen Schmidhuber,et al. Learning Precise Timing with LSTM Recurrent Networks , 2003, J. Mach. Learn. Res..

[2] Stefan Schaal,et al. Policy Gradient Methods for Robotics , 2006, 2006 IEEE/RSJ International Conference on Intelligent Robots and Systems.

[3] Risto Miikkulainen,et al. Incremental Evolution of Complex General Behavior , 1997, Adapt. Behav..

[4] Mauro Birattari,et al. Swarm Intelligence , 2012, Lecture Notes in Computer Science.

[5] Risto Miikkulainen,et al. Efficient Non-linear Control Through Neuroevolution , 2006, ECML.

[6] Pieter Bram Bakker,et al. The state of mind : reinforcement learning with recurrent neural networks , 2004 .

[7] Martin A. Riedmiller. Neural Fitted Q Iteration - First Experiences with a Data Efficient Neural Reinforcement Learning Method , 2005, ECML.

[8] Leemon C. Baird,et al. Residual Algorithms: Reinforcement Learning with Function Approximation , 1995, ICML.

[9] Mahesan Niranjan,et al. On-line Q-learning using connectionist systems , 1994 .

[10] Stefan Schaal,et al. 2008 Special Issue: Reinforcement learning of motor skills with policy gradients , 2008 .

[11] Risto Miikkulainen,et al. Evolving Neural Networks through Augmenting Topologies , 2002, Evolutionary Computation.

[12] J. Schmidhuber,et al. Framewise phoneme classification with bidirectional LSTM networks , 2005, Proceedings. 2005 IEEE International Joint Conference on Neural Networks, 2005..

[13] Nicol N. Schraudolph,et al. Fast Online Policy Gradient Learning with SMD Gain Vector Adaptation , 2005, NIPS.

[14] Shalabh Bhatnagar,et al. Incremental Natural Actor-Critic Algorithms , 2007, NIPS.

[15] P. Glynn. Optimization of stochastic systems via simulation , 1989, WSC '89.

[16] Risto Miikkulainen,et al. Solving Non-Markovian Control Tasks with Neuro-Evolution , 1999, IJCAI.

[17] Risto Miikkulainen,et al. Efficient Reinforcement Learning through Symbiotic Evolution , 1996, Machine Learning.

[18] Gerald Tesauro,et al. TD-Gammon, a Self-Teaching Backgammon Program, Achieves Master-Level Play , 1994, Neural Computation.

[19] Andrew McCallum,et al. Instance-Based Utile Distinctions for Reinforcement Learning with Hidden State , 1995, ICML.

[20] Christian Igel,et al. Neuroevolution for reinforcement learning using evolution strategies , 2003, The 2003 Congress on Evolutionary Computation, 2003. CEC '03..

[21] J. Baxter,et al. Direct gradient-based reinforcement learning , 2000, 2000 IEEE International Symposium on Circuits and Systems. Emerging Technologies for the 21st Century. Proceedings (IEEE Cat No.00CH36353).

[22] Douglas Aberdeen,et al. Policy-Gradient Algorithms for Partially Observable Markov Decision Processes , 2003 .

[23] Tom Schaul,et al. Efficient natural evolution strategies , 2009, GECCO.

[24] Christian Igel,et al. Similarities and differences between policy gradient methods and evolution strategies , 2008, ESANN.

[25] Long-Ji Lin,et al. Reinforcement learning for robots using neural networks , 1992 .

[26] Nikolaus Hansen,et al. Completely Derandomized Self-Adaptation in Evolution Strategies , 2001, Evolutionary Computation.

[27] Jürgen Schmidhuber,et al. Recurrent policy gradients , 2010, Log. J. IGPL.

[28] Shun-ichi Amari,et al. Why natural gradient? , 1998, Proceedings of the 1998 IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP '98 (Cat. No.98CH36181).

[29] Risto Miikkulainen,et al. Efficient evolution of neural networks through complexification , 2004 .

[30] Tom Schaul,et al. Fitness Expectation Maximization , 2008, PPSN.

[31] D. Prokhorov. Toward effective combination of off-line and on-line training in ADP framework , 2007, 2007 IEEE International Symposium on Approximate Dynamic Programming and Reinforcement Learning.

[32] Shun-ichi Amari,et al. Natural Gradient Works Efficiently in Learning , 1998, Neural Computation.

[33] Peter Stone,et al. Policy gradient reinforcement learning for fast quadrupedal locomotion , 2004, IEEE International Conference on Robotics and Automation, 2004. Proceedings. ICRA '04. 2004.

[34] Shalabh Bhatnagar,et al. Fast gradient-descent methods for temporal-difference learning with linear function approximation , 2009, ICML '09.

[35] Stefan Schaal,et al. Reinforcement learning by reward-weighted regression for operational space control , 2007, ICML '07.

[36] Judy A. Franklin,et al. Biped dynamic walking using reinforcement learning , 1997, Robotics Auton. Syst..

[37] Jürgen Schmidhuber,et al. Gödel Machines: Fully Self-referential Optimal Universal Self-improvers , 2007, Artificial General Intelligence.

[38] Dirk P. Kroese,et al. Cross‐Entropy Method , 2011 .

[39] Tom Schaul,et al. Stochastic search using the natural gradient , 2009, ICML '09.

[40] Eduardo Sontag,et al. Turing computability with neural nets , 1991 .

[41] Leslie Pack Kaelbling,et al. Learning Policies for Partially Observable Environments: Scaling Up , 1997, ICML.

[42] Julian Togelius,et al. Point-to-Point Car Racing: an Initial Study of Evolution Versus Temporal Difference Learning , 2007, 2007 IEEE Symposium on Computational Intelligence and Games.

[43] Marcus Hutter,et al. Universal Artificial Intelligence: Sequential Decisions Based on Algorithmic Probability (Texts in Theoretical Computer Science. An EATCS Series) , 2006 .

[44] J. Spall,et al. Theoretical framework for comparing several popular stochastic optimization approaches , 2002 .

[45] Yoshua Bengio,et al. Gradient Flow in Recurrent Nets: the Difficulty of Learning Long-Term Dependencies , 2001 .

[46] Jürgen Schmidhuber,et al. Long Short-Term Memory , 1997, Neural Computation.

[47] Julian Togelius,et al. The WCCI 2008 simulated car racing competition , 2008, 2008 IEEE Symposium On Computational Intelligence and Games.

[48] P J Webros. BACKPROPAGATION THROUGH TIME: WHAT IT DOES AND HOW TO DO IT , 1990 .

[49] Leslie Pack Kaelbling,et al. Planning and Acting in Partially Observable Stochastic Domains , 1998, Artif. Intell..

[50] Stuart J. Russell,et al. Approximating Optimal Policies for Partially Observable Stochastic Domains , 1995, IJCAI.

[51] Michael I. Jordan,et al. Learning Without State-Estimation in Partially Observable Markovian Decision Processes , 1994, ICML.

[52] Hans-Paul Schwefel,et al. Evolution strategies – A comprehensive introduction , 2002, Natural Computing.

[53] Chris Watkins,et al. Learning from delayed rewards , 1989 .

[54] Kee-Eung Kim,et al. Learning Finite-State Controllers for Partially Observable Environments , 1999, UAI.

[55] Tom Schaul,et al. Natural Evolution Strategies , 2008, 2008 IEEE Congress on Evolutionary Computation (IEEE World Congress on Computational Intelligence).

[56] Richard S. Sutton,et al. Reinforcement Learning: An Introduction , 1998, IEEE Trans. Neural Networks.

[57] Jing J. Liang,et al. Problem Definitions and Evaluation Criteria for the CEC 2005 Special Session on Real-Parameter Optimization , 2005 .

[58] Shimon Whiteson,et al. Empirical Studies in Action Selection with Reinforcement Learning , 2007, Adapt. Behav..

[59] Hans-Georg Beyer,et al. Toward a Theory of Evolution Strategies: Self-Adaptation , 1995, Evolutionary Computation.

[60] Vijaykumar Gullapalli,et al. A stochastic reinforcement learning algorithm for learning real-valued functions , 1990, Neural Networks.

[61] R. J. Williams,et al. Simple Statistical Gradient-Following Algorithms for Connectionist Reinforcement Learning , 2004, Machine Learning.

[62] Tom Schaul,et al. Episodic Reinforcement Learning by Logistic Reward-Weighted Regression , 2008, ICANN.

[63] James C. Spall,et al. Stochastic optimization and the simultaneous perturbation method , 1999, WSC '99.

[64] A. P. Wieland,et al. Evolving neural network controllers for unstable systems , 1991, IJCNN-91-Seattle International Joint Conference on Neural Networks.

[65] Hans-Paul Schwefel,et al. TWO-PHASE NOZZLE AND HOLLOW CORE JET EXPERIMENTS. , 1970 .

[66] Ingo Rechenberg,et al. Evolutionsstrategie : Optimierung technischer Systeme nach Prinzipien der biologischen Evolution , 1973 .

[67] Yishay Mansour,et al. Policy Gradient Methods for Reinforcement Learning with Function Approximation , 1999, NIPS.

[68] Lonnie Chrisman,et al. Reinforcement Learning with Perceptual Aliasing: The Perceptual Distinctions Approach , 1992, AAAI.

[69] S. Shankar Sastry,et al. Autonomous Helicopter Flight via Reinforcement Learning , 2003, NIPS.

[70] Peter L. Bartlett,et al. Experiments with Infinite-Horizon, Policy-Gradient Estimation , 2001, J. Artif. Intell. Res..

[71] Richard S. Sutton,et al. Learning to predict by the methods of temporal differences , 1988, Machine Learning.

[72] Matthew Saffell,et al. Learning to trade via direct reinforcement , 2001, IEEE Trans. Neural Networks.

[73] Bram Bakker,et al. Reinforcement Learning with Long Short-Term Memory , 2001, NIPS.

[74] Andrew McCallum,et al. Overcoming Incomplete Perception with Utile Distinction Memory , 1993, ICML.

[75] Julian Togelius,et al. Evolution of Neural Networks for Helicopter Control: Why Modularity Matters , 2006, 2006 IEEE International Conference on Evolutionary Computation.

[76] John N. Tsitsiklis,et al. Neuro-Dynamic Programming , 1996, Encyclopedia of Machine Learning.

[77] Jürgen Schmidhuber,et al. Solving Deep Memory POMDPs with Recurrent Policy Gradients , 2007, ICANN.

[78] Risto Miikkulainen,et al. Accelerated Neural Evolution through Cooperatively Coevolved Synapses , 2008, J. Mach. Learn. Res..

[79] Vijaykumar Gullapalli,et al. Reinforcement learning and its application to control , 1992 .

[80] Geoffrey E. Hinton,et al. Using Expectation-Maximization for Reinforcement Learning , 1997, Neural Computation.