Simple statistical gradient-following algorithms for connectionist reinforcement learning

This article presents a general class of associative reinforcement learning algorithms for connectionist networks containing stochastic units. These algorithms, called REINFORCE algorithms, are shown to make weight adjustments in a direction that lies along the gradient of expected reinforcement in both immediate-reinforcement tasks and certain limited forms of delayed-reinforcement tasks, and they do this without explicitly computing gradient estimates or even storing information from which such estimates could be computed. Specific examples of such algorithms are presented, some of which bear a close relationship to certain existing algorithms while others are novel but potentially interesting in their own right. Also given are results that show how such algorithms can be naturally integrated with backpropagation. We close with a brief discussion of a number of additional issues surrounding the use of such algorithms, including what is known about their limiting behaviors as well as further considerations that might be used to help develop similar but potentially more powerful reinforcement learning algorithms.

[1]  M. A. L. THATHACHAR,et al.  A new approach to the design of reinforcement schemes for learning automata , 1985, IEEE Transactions on Systems, Man, and Cybernetics.

[2]  Richard S. Sutton,et al.  Temporal credit assignment in reinforcement learning , 1984 .

[3]  Richard S. Sutton,et al.  Associative search network: A reinforcement learning associative memory , 1981, Biological Cybernetics.

[4]  P. Werbos,et al.  Beyond Regression : "New Tools for Prediction and Analysis in the Behavioral Sciences , 1974 .

[5]  Jing Peng,et al.  Function Optimization using Connectionist Reinforcement Learning Algorithms , 1991 .

[6]  Kumpati S. Narendra,et al.  Learning automata - an introduction , 1989 .

[7]  K. Narendra,et al.  Decentralized learning in finite Markov chains , 1985, 1985 24th IEEE Conference on Decision and Control.

[8]  R. J. Williams,et al.  On the use of backpropagation in associative reinforcement learning , 1988, IEEE 1988 International Conference on Neural Networks.

[9]  Jonathan Baxter,et al.  Learning internal representations , 1995, COLT '95.

[10]  Keith Price,et al.  Review of "Principles of Artificial Intelligence by Nils J. Nilsson", Tioga Publishing Company, Palo Alto, CA, ISBN 0-935382-01-1. , 1980, SGAR.

[11]  Nils J. Nilsson,et al.  Principles of Artificial Intelligence , 1981, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[12]  M. Gabriel,et al.  Learning and Computational Neuroscience: Foundations of Adaptive Networks , 1990 .

[13]  Yann LeCun,et al.  Une procedure d'apprentissage pour reseau a seuil asymmetrique (A learning scheme for asymmetric threshold networks) , 1985 .

[14]  P. Anandan,et al.  Pattern-recognizing stochastic learning automata , 1985, IEEE Transactions on Systems, Man, and Cybernetics.

[15]  Richard S. Sutton,et al.  Learning and Sequential Decision Making , 1989 .

[16]  Geoffrey E. Hinton,et al.  Learning and relearning in Boltzmann machines , 1986 .

[17]  James L. McClelland,et al.  Parallel distributed processing: explorations in the microstructure of cognition, vol. 1: foundations , 1986 .

[18]  Richard S. Sutton,et al.  Learning to Predict by the Methods of Temporal Differences , 1988, Machine Learning.

[19]  A G Barto,et al.  Learning by statistical cooperation of self-interested neuron-like computing elements. , 1985, Human neurobiology.

[20]  Vijaykumar Gullapalli A stochastic reinforcement learning algorithm for learning real-valued functions , 1990, Neural Networks.

[21]  Geoffrey E. Hinton,et al.  Learning internal representations by error propagation , 1986 .

[22]  Kumpati S. Narendra,et al.  An N-player sequential stochastic game with identical payoffs , 1983, IEEE Transactions on Systems, Man, and Cybernetics.

[23]  Robert J. Beaver,et al.  An Introduction to Probability Theory and Mathematical Statistics. , 1976 .

[24]  Graham C. Goodwin,et al.  Adaptive filtering prediction and control , 1984 .

[25]  Chris Watkins,et al.  Learning from delayed rewards , 1989 .

[26]  Michael I. Jordan,et al.  Forward Models: Supervised Learning with a Distal Teacher , 1992, Cogn. Sci..

[27]  Richard S. Sutton,et al.  Neuronlike adaptive elements that can solve difficult learning control problems , 1983, IEEE Transactions on Systems, Man, and Cybernetics.