Integrating Episodic Memory into a Reinforcement Learning Agent using Reservoir Sampling

Episodic memory is a psychology term which refers to the ability to recall specific events from the past. We suggest one advantage of this particular type of memory is the ability to easily assign credit to a specific state when remembered information is found to be useful. Inspired by this idea, and the increasing popularity of external memory mechanisms to handle long-term dependencies in deep learning systems, we propose a novel algorithm which uses a reservoir sampling procedure to maintain an external memory consisting of a fixed number of past states. The algorithm allows a deep reinforcement learning agent to learn online to preferentially remember those states which are found to be useful to recall later on. Critically this method allows for efficient online computation of gradient estimates with respect to the write process of the external memory. Thus unlike most prior mechanisms for external memory it is feasible to use in an online reinforcement learning setting.

[1]  M. Chao A general purpose unequal probability sampling plan , 1982 .

[2]  Tomas Mikolov,et al.  Inferring Algorithmic Patterns with Stack-Augmented Recurrent Nets , 2015, NIPS.

[3]  Alex Graves,et al.  Neural Turing Machines , 2014, ArXiv.

[4]  Yoshua Bengio,et al.  On the Properties of Neural Machine Translation: Encoder–Decoder Approaches , 2014, SSST@EMNLP.

[5]  Yoshua Bengio,et al.  Memory Augmented Neural Networks with Wormhole Connections , 2017, ArXiv.

[6]  Wojciech Zaremba,et al.  Reinforcement Learning Neural Turing Machines , 2015, ArXiv.

[7]  Wojciech Zaremba,et al.  Reinforcement Learning Neural Turing Machines - Revised , 2015 .

[8]  Paul G. Spirakis,et al.  Weighted random sampling with a reservoir , 2006, Inf. Process. Lett..

[9]  Sergio Gomez Colmenarejo,et al.  Hybrid computing using a neural network with dynamic external memory , 2016, Nature.

[10]  Peter Stone,et al.  Deep Recurrent Q-Learning for Partially Observable MDPs , 2015, AAAI Fall Symposia.

[11]  Patrick M. Pilarski,et al.  Horde: a scalable real-time architecture for learning knowledge from unsupervised sensorimotor interaction , 2011, AAMAS.

[12]  Honglak Lee,et al.  Control of Memory, Active Perception, and Action in Minecraft , 2016, ICML.

[13]  Yishay Mansour,et al.  Policy Gradient Methods for Reinforcement Learning with Function Approximation , 1999, NIPS.

[14]  R. J. Williams,et al.  Simple Statistical Gradient-Following Algorithms for Connectionist Reinforcement Learning , 2004, Machine Learning.

[15]  Aurko Roy,et al.  Learning to Remember Rare Events , 2017, ICLR.

[16]  Alex Graves,et al.  Asynchronous Methods for Deep Reinforcement Learning , 2016, ICML.

[17]  Jason Weston,et al.  End-To-End Memory Networks , 2015, NIPS.

[18]  Jürgen Schmidhuber,et al.  Recurrent policy gradients , 2010, Log. J. IGPL.

[19]  Jürgen Schmidhuber,et al.  A robot that reinforcement-learns to identify and memorize important previous observations , 2003, Proceedings 2003 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS 2003) (Cat. No.03CH37453).

[20]  Sergey Levine,et al.  Policy Learning with Continuous Memory States for Partially Observed Robotic Control , 2015, ArXiv.