Making the World Di erentiable: On Using Self-Supervised Fully Recurrent Neural Networks for Dynamic Reinforcement Learning

First a brief introduction to reinforcement learning and to supervised learning with recurrent networks in non-stationary environments is given. The introduction also covers the basic principle of`gradient descent through frozen model networks' as employed by Werbos, Jordan, Munro, Robinson and Fallside, and Nguyen and Widrow. This principle allows supervised learning techniques to be employed for reinforcement learning. Then a general algorithm for a reinforcement learning neural network with internal and external feedback in a non-stationary reactive environment is described. Internal feedback is given by connections that allow cyclic activation ow through the network. External feedback is given by output actions that may change the state of the environment thus innuencing subsequent input activations. The network's main goal is to receive as much reinforcement (or as littlèpain') as possible. In theory, arbitrary time lags between actions and ulterior consequences are possible. Thèvisi-ble' environment may bènon-Markovian'. Although the approach is based on`supervised' learning algorithms for fully recurrent dynamic networks, no teacher is required. An adaptive model of the environmental dynamics is constructed which includes a model of future reinforcement to be received. This model is used for learning goal directed behavior. The algorithm may concurrently learn the model and learn to pursue the main goal. To attack certain problems with the parallel version of the algorithm, `adaptive randomness' is introduced. The algorithm is applied to a reinforcement learning task in a non-Markovian environment. A connection tòmeta-learning' (learning how to learn) is noted. An extension of the algorithm is described which includes a vector-valued adaptive critic element (based on Sutton's TD-methods). The possibility of using the model for learning by planning (by`mental simulation' of the environmental dynamics) is investigated. Finally it is described how the algorithm can be augmented by dynamic curiosity and boredom. This can be done by introducing (delayed) reinforcement for controller actions that increase the model network's knowledge about the world. This in turn requires the model network to model its own ignorance, thus showing a rudimentary form of introspective behavior. 1 Figure 1: Internal and external feedback.

[1]  M. Gherrity,et al.  A learning algorithm for analog, fully recurrent neural networks , 1989, International 1989 Joint Conference on Neural Networks.

[2]  Teuvo Kohonen,et al.  Self-Organization and Associative Memory , 1988 .

[3]  Geoffrey E. Hinton,et al.  Learning internal representations by error propagation , 1986 .

[4]  Charles W. Anderson,et al.  Learning and problem-solving with multilayer connectionist systems (adaptive, strategy learning, neural networks, reinforcement learning) , 1986 .

[5]  H. Franke,et al.  Ästhetik als Informationsverarbeitung , 1974 .

[6]  Richard S. Sutton,et al.  Temporal credit assignment in reinforcement learning , 1984 .

[7]  PAUL J. WERBOS,et al.  Generalization of backpropagation with application to a recurrent gas market model , 1988, Neural Networks.

[8]  Jürgen Schmidhuber,et al.  A possibility for implementing curiosity and boredom in model-building neural controllers , 1991 .

[9]  Barak A. Pearlmutter Learning State Space Trajectories in Recurrent Neural Networks , 1989, Neural Computation.

[10]  Paul J. Werbos,et al.  Building and Understanding Adaptive Systems: A Statistical/Numerical Approach to Factory Automation and Brain Research , 1987, IEEE Transactions on Systems, Man, and Cybernetics.

[11]  B. Widrow,et al.  The truck backer-upper: an example of self-learning in neural networks , 1989, International 1989 Joint Conference on Neural Networks.

[12]  Frank Fallside,et al.  Dynamic reinforcement driven error propagation networks with application to game playing , 1989 .

[13]  Ronald J. Williams,et al.  Experimental Analysis of the Real-time Recurrent Learning Algorithm , 1989 .

[14]  P. Werbos,et al.  Beyond Regression : "New Tools for Prediction and Analysis in the Behavioral Sciences , 1974 .

[15]  Jeffrey L. Elman,et al.  Finding Structure in Time , 1990, Cogn. Sci..

[16]  Jürgen Schmidhuber,et al.  A Local Learning Algorithm for Dynamic Feedforward and Recurrent Networks , 1989 .

[17]  Jürgen Schmidhuber,et al.  Reinforcement Learning with Interacting Continually Running Fully Recurrent Networks , 1990 .

[18]  P. J. Werbos,et al.  Backpropagation and neurocontrol: a review and prospectus , 1989, International 1989 Joint Conference on Neural Networks.

[19]  P. Werbos,et al.  Expectation Driven Learning with an Associative Memory , 1990 .

[20]  Richard S. Sutton,et al.  Neuronlike adaptive elements that can solve difficult learning control problems , 1983, IEEE Transactions on Systems, Man, and Cybernetics.

[21]  Jürgen Schmidhuber,et al.  Recurrent networks adjusted by adaptive critics , 1990 .

[22]  Jürgen Schmidhuber,et al.  Networks adjusting networks , 1990 .

[23]  Michael I. Jordan,et al.  Learning to Control an Unstable System with Forward Modeling , 1989, NIPS.

[24]  T. Sejnowski,et al.  Learning Algorithms for Networks with Internal and External Feedback , 1990 .

[25]  Michael I. Jordan Supervised learning and systems with excess degrees of freedom , 1988 .

[26]  R. J. Williams,et al.  On the use of backpropagation in associative reinforcement learning , 1988, IEEE 1988 International Conference on Neural Networks.

[27]  Arthur L. Samuel,et al.  Some Studies in Machine Learning Using the Game of Checkers , 1967, IBM J. Res. Dev..

[28]  Jürgen Schmidhuber,et al.  An on-line algorithm for dynamic reinforcement learning and planning in reactive environments , 1990, 1990 IJCNN International Joint Conference on Neural Networks.

[29]  S. Piche,et al.  First-Order Gradient Descent Training of Adaptive Discrete-Time Dynamic Networks , 1991 .