Improving Generalization in Meta Reinforcement Learning using Neural Objectives

Biological evolution has distilled the experiences of many learners into the general learning algorithms of humans. Our novel meta-reinforcement learning algorithm MetaGenRL is inspired by this process. MetaGenRL distills the experiences of many complex agents to meta-learn a low-complexity neural objective function that affects how future individuals will learn. Unlike recent meta-RL algorithms, MetaGenRL can generalize to new environments that are entirely different from those used for meta-training. In some cases, it even outperforms human-engineered RL algorithms. MetaGenRL uses off-policy second-order gradients during meta-training that greatly increase its sample efficiency.

[1]  Marcin Andrychowicz,et al.  Learning to learn by gradient descent by gradient descent , 2016, NIPS.

[2]  Jieyu Zhao,et al.  Direct Policy Search and Uncertain Policy Evaluation , 1998 .

[3]  Jürgen Schmidhuber,et al.  Recurrent World Models Facilitate Policy Evolution , 2018, NeurIPS.

[4]  Jitendra Malik,et al.  Learning to Optimize , 2016, ICLR.

[5]  Jürgen Schmidhuber,et al.  Learning to Forget: Continual Prediction with LSTM , 2000, Neural Computation.

[6]  Alex Graves,et al.  Playing Atari with Deep Reinforcement Learning , 2013, ArXiv.

[7]  Razvan Pascanu,et al.  Policy Distillation , 2015, ICLR.

[8]  Tom Schaul,et al.  Natural Evolution Strategies , 2008, 2008 IEEE Congress on Evolutionary Computation (IEEE World Congress on Computational Intelligence).

[9]  Sergey Levine,et al.  Model-Agnostic Meta-Learning for Fast Adaptation of Deep Networks , 2017, ICML.

[10]  Jürgen Schmidhuber,et al.  A ‘Self-Referential’ Weight Matrix , 1993 .

[11]  Filip De Turck,et al.  VIME: Variational Information Maximizing Exploration , 2016, NIPS.

[12]  Jascha Sohl-Dickstein,et al.  Learning Unsupervised Learning Rules , 2018, ArXiv.

[13]  Marcin Andrychowicz,et al.  Multi-Goal Reinforcement Learning: Challenging Robotics Environments and Request for Research , 2018, ArXiv.

[14]  Ruslan Salakhutdinov,et al.  Actor-Mimic: Deep Multitask and Transfer Reinforcement Learning , 2015, ICLR.

[15]  Pieter Abbeel,et al.  A Simple Neural Attentive Meta-Learner , 2017, ICLR.

[16]  Jeff Clune,et al.  AI-GAs: AI-generating algorithms, an alternate paradigm for producing general artificial intelligence , 2019, ArXiv.

[17]  Sergey Levine,et al.  Meta-Learning and Universality: Deep Representations and Gradient Descent can Approximate any Learning Algorithm , 2017, ICLR.

[18]  Guy Lever,et al.  Deterministic Policy Gradient Algorithms , 2014, ICML.

[19]  Yuval Tassa,et al.  Continuous control with deep reinforcement learning , 2015, ICLR.

[20]  Razvan Pascanu,et al.  Meta-Learning with Latent Embedding Optimization , 2018, ICLR.

[21]  Sergey Levine,et al.  Trust Region Policy Optimization , 2015, ICML.

[22]  Sepp Hochreiter,et al.  Learning to Learn Using Gradient Descent , 2001, ICANN.

[23]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[24]  Yuval Tassa,et al.  MuJoCo: A physics engine for model-based control , 2012, 2012 IEEE/RSJ International Conference on Intelligent Robots and Systems.

[25]  Pieter Abbeel,et al.  Evolved Policy Gradients , 2018, NeurIPS.

[26]  Alexei A. Efros,et al.  Curiosity-Driven Exploration by Self-Supervised Prediction , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition Workshops (CVPRW).

[27]  Daan Wierstra,et al.  Stochastic Backpropagation and Approximate Inference in Deep Generative Models , 2014, ICML.

[28]  Jürgen Schmidhuber,et al.  Long Short-Term Memory , 1997, Neural Computation.

[29]  Herke van Hoof,et al.  Addressing Function Approximation Error in Actor-Critic Methods , 2018, ICML.

[30]  Jitendra Malik,et al.  Learning to Optimize Neural Nets , 2017, ArXiv.

[31]  Yevgen Chebotar,et al.  Meta Learning via Learned Loss , 2019, 2020 25th International Conference on Pattern Recognition (ICPR).

[32]  R. J. Williams,et al.  On the use of backpropagation in associative reinforcement learning , 1988, IEEE 1988 International Conference on Neural Networks.

[33]  Razvan Pascanu,et al.  Imagination-Augmented Agents for Deep Reinforcement Learning , 2017, NIPS.

[34]  Alec Radford,et al.  Proximal Policy Optimization Algorithms , 2017, ArXiv.

[35]  Michael I. Jordan,et al.  RLlib: Abstractions for Distributed Reinforcement Learning , 2017, ICML.

[36]  Ronald J. Williams,et al.  Simple Statistical Gradient-Following Algorithms for Connectionist Reinforcement Learning , 2004, Machine Learning.

[37]  David Silver,et al.  Meta-Gradient Reinforcement Learning , 2018, NeurIPS.

[38]  Max Welling,et al.  Auto-Encoding Variational Bayes , 2013, ICLR.

[39]  Zeb Kurth-Nelson,et al.  Learning to reinforcement learn , 2016, CogSci.

[40]  Jürgen Schmidhuber,et al.  A possibility for implementing curiosity and boredom in model-building neural controllers , 1991 .

[41]  John Schulman,et al.  Gotta Learn Fast: A New Benchmark for Generalization in RL , 2018, ArXiv.

[42]  Yoshua Bengio,et al.  Bayesian Model-Agnostic Meta-Learning , 2018, NeurIPS.