Learning Agent State Online with Recurrent Generate-and-Test

Learning continually and online from a continuous stream of data is challenging, especially for a reinforcement learning agent with sequential data. When the environment only provides observations giving partial information about the state of the environment, the agent must learn the agent state based on the data stream of experience. We refer to the state learned directly from the data stream of experience as the agent state. Recurrent neural networks can learn the agent state, but the training methods are computationally expensive and sensitive to the hyper-parameters, making them unideal for online learning. This work introduces methods based on the generate-and-test approach to learn the agent state. A generate-and-test algorithm searches for state features by generating features and testing their usefulness. In this process, features useful for the agent’s performance on the task are preserved, and the least useful features get replaced with newly generated features. We study the effectiveness of our methods on two online multistep prediction problems. The first problem, trace conditioning, focuses on the agent’s ability to remember a cue for a prediction multiple steps into the future. In the second problem, trace patterning, the agent needs to learn patterns in the observation signals and remember them for future predictions. We show that our proposed methods can effectively learn the agent state online and produce accurate predictions.

[1]  Elliot A. Ludvig,et al.  From eye-blinks to state construction: Diagnostic benchmarks for online representation learning , 2020 .

[2]  Richard S. Sutton,et al.  Temporal Abstraction in Temporal-difference Networks , 2005, NIPS.

[3]  Yann Ollivier,et al.  Unbiased Online Recurrent Optimization , 2017, ICLR.

[4]  Yoshua Bengio,et al.  Gradient Flow in Recurrent Nets: the Difficulty of Learning Long-Term Dependencies , 2001 .

[5]  Patrick M. Pilarski,et al.  Horde: a scalable real-time architecture for learning knowledge from unsupervised sensorimotor interaction , 2011, AAMAS.

[6]  Yoshua Bengio,et al.  Empirical Evaluation of Gated Recurrent Neural Networks on Sequence Modeling , 2014, ArXiv.

[7]  Richard S. Sutton,et al.  Temporal-Difference Networks , 2004, NIPS.

[8]  Richard S. Sutton,et al.  Continual Backprop: Stochastic Gradient Descent with Persistent Randomness , 2021, ArXiv.

[9]  Richard S. Sutton,et al.  Online Learning with Random Representations , 1993, ICML.

[10]  I. Pavlov,et al.  Conditioned reflexes: An investigation of the physiological activity of the cerebral cortex , 2010, Annals of Neurosciences.

[11]  Ronald J. Williams,et al.  A Learning Algorithm for Continually Running Fully Recurrent Neural Networks , 1989, Neural Computation.

[12]  Jing Peng,et al.  An Efficient Gradient-Based Algorithm for On-Line Training of Recurrent Network Trajectories , 1990, Neural Computation.

[13]  Jeffrey L. Elman,et al.  Finding Structure in Time , 1990, Cogn. Sci..

[14]  Doina Precup,et al.  Between MDPs and Semi-MDPs: A Framework for Temporal Abstraction in Reinforcement Learning , 1999, Artif. Intell..

[15]  Martha White,et al.  General Value Function Networks , 2018, J. Artif. Intell. Res..

[16]  M. Gabriel,et al.  Learning and Computational Neuroscience: Foundations of Adaptive Networks , 1990 .

[17]  Richard S. Sutton,et al.  Multi-timescale nexting in a reinforcement learning robot , 2011, Adapt. Behav..

[18]  Erich Elsen,et al.  A Practical Sparse Approximation for Real Time Recurrent Learning , 2020, ArXiv.

[19]  Richard S. Sutton,et al.  Adapting Bias by Gradient Descent: An Incremental Version of Delta-Bar-Delta , 1992, AAAI.

[20]  Richard S. Sutton,et al.  Learning to predict by the methods of temporal differences , 1988, Machine Learning.

[21]  Elliot A. Ludvig,et al.  Evaluating the TD model of classical conditioning , 2012, Learning & behavior.

[22]  Jürgen Schmidhuber,et al.  Long Short-Term Memory , 1997, Neural Computation.

[23]  Christian Lebiere,et al.  The Cascade-Correlation Learning Architecture , 1989, NIPS.

[24]  Richard S. Sutton,et al.  Representation Search through Generate and Test , 2013, AAAI Workshop: Learning Rich Representations from Low-Level Sensors.

[25]  Richard S. Sutton,et al.  Time-Derivative Models of Pavlovian Reinforcement , 1990 .