Continual Backprop: Stochastic Gradient Descent with Persistent Randomness

The Backprop algorithm for learning in neural networks utilizes two mechanisms: first, stochastic gradient descent and second, initialization with small random weights, where the latter is essential to the effectiveness of the former. We show that in continual learning setups, Backprop performs well initially, but over time its performance degrades. Stochastic gradient descent alone is insufficient to learn continually; the initial randomness enables only initial learning but not continual learning. To the best of our knowledge, ours is the first result showing this degradation in Backprop’s ability to learn. To address this issue, we propose an algorithm that continually injects random features alongside gradient descent using a new generate-and-test process. We call this the Continual Backprop algorithm. We show that, unlike Backprop, Continual Backprop is able to continually adapt in both supervised and reinforcement learning problems. We expect that as continual learning becomes more common in future applications, a method like Continual Backprop will be essential where the advantages of random initialization are present throughout learning.

[1]  Martha White,et al.  Meta-Learning Representations for Continual Learning , 2019, NeurIPS.

[2]  Francesco Orabona A Modern Introduction to Online Learning , 2019, ArXiv.

[3]  Leslie Pack Kaelbling,et al.  Learning in embedded systems , 1993 .

[4]  Richard S. Sutton,et al.  Online Learning with Random Representations , 1993, ICML.

[5]  Yoshua Bengio,et al.  Understanding the difficulty of training deep feedforward neural networks , 2010, AISTATS.

[6]  Geoffrey E. Hinton,et al.  ImageNet classification with deep convolutional neural networks , 2012, Commun. ACM.

[7]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[8]  Steven C. H. Hoi,et al.  Online Deep Learning: Learning Deep Neural Networks on the Fly , 2017, IJCAI.

[9]  Yee Whye Teh,et al.  Progress & Compress: A scalable framework for continual learning , 2018, ICML.

[10]  Shane Legg,et al.  Human-level control through deep reinforcement learning , 2015, Nature.

[11]  Shimon Whiteson,et al.  Counterfactual Multi-Agent Policy Gradients , 2017, AAAI.

[12]  Pieter Abbeel,et al.  Continuous Adaptation via Meta-Learning in Nonstationary and Competitive Environments , 2017, ICLR.

[13]  W S McCulloch,et al.  A logical calculus of the ideas immanent in nervous activity , 1990, The Philosophy of Artificial Intelligence.

[14]  Geoffrey E. Hinton,et al.  Learning representations by back-propagating errors , 1986, Nature.

[15]  Sergey Levine,et al.  Meta-Learning , 2019, Automated Machine Learning.

[16]  Doina Precup,et al.  Towards Continual Reinforcement Learning: A Review and Perspectives , 2020, ArXiv.

[17]  Richard S. Sutton,et al.  Representation Search through Generate and Test , 2013, AAAI Workshop: Learning Rich Representations from Low-Level Sensors.

[18]  Shimon Whiteson,et al.  Evolutionary Function Approximation for Reinforcement Learning , 2006, J. Mach. Learn. Res..

[19]  Richard S. Sutton,et al.  On the role of tracking in stationary environments , 2007, ICML '07.

[20]  Sergey Levine,et al.  Model-Agnostic Meta-Learning for Fast Adaptation of Deep Networks , 2017, ICML.

[21]  Martial Hebert,et al.  Growing a Brain: Fine-Tuning by Increasing Model Capacity , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[22]  Razvan Pascanu,et al.  Overcoming catastrophic forgetting in neural networks , 2016, Proceedings of the National Academy of Sciences.

[23]  Mark B. Ring CHILD: A First Step Towards Continual Learning , 1997, Machine Learning.

[24]  Matthieu Geist,et al.  What Matters for On-Policy Deep Actor-Critic Methods? A Large-Scale Study , 2021, ICLR.

[25]  Sergey Levine,et al.  Learning to Adapt in Dynamic, Real-World Environments through Meta-Reinforcement Learning , 2018, ICLR.

[26]  Yoshua Bengio,et al.  Gradient-based learning applied to document recognition , 1998, Proc. IEEE.

[27]  R. French Catastrophic forgetting in connectionist networks , 1999, Trends in Cognitive Sciences.

[28]  Sepp Hochreiter,et al.  Fast and Accurate Deep Network Learning by Exponential Linear Units (ELUs) , 2015, ICLR.

[29]  Rich Caruana,et al.  Multitask Learning , 1998, Encyclopedia of Machine Learning and Data Mining.

[30]  Alex Graves,et al.  Asynchronous Methods for Deep Reinforcement Learning , 2016, ICML.

[31]  Andrew L. Maas Rectifier Nonlinearities Improve Neural Network Acoustic Models , 2013 .

[32]  Parash Rahman Toward Generate-and-Test Algorithms for Continual Feature Discovery , 2021 .

[33]  Geoffrey E. Hinton,et al.  Rectified Linear Units Improve Restricted Boltzmann Machines , 2010, ICML.

[34]  Demis Hassabis,et al.  Mastering the game of Go with deep neural networks and tree search , 2016, Nature.

[35]  Jian Sun,et al.  Delving Deep into Rectifiers: Surpassing Human-Level Performance on ImageNet Classification , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[36]  Stefan Wermter,et al.  Continual Lifelong Learning with Neural Networks: A Review , 2019, Neural Networks.

[37]  Mark Chen,et al.  Language Models are Few-Shot Learners , 2020, NeurIPS.

[38]  Alec Radford,et al.  Proximal Policy Optimization Algorithms , 2017, ArXiv.

[39]  Geoffrey E. Hinton,et al.  On the importance of initialization and momentum in deep learning , 2013, ICML.

[40]  Steven C. H. Hoi,et al.  Online Learning: A Comprehensive Survey , 2018, Neurocomputing.

[41]  Michael McCloskey,et al.  Catastrophic Interference in Connectionist Networks: The Sequential Learning Problem , 1989 .

[42]  Quoc V. Le,et al.  Searching for Activation Functions , 2018, arXiv.