Scaling characteristics of sequential multitask learning: Networks naturally learn to learn

We explore the behavior of a standard convolutional neural net in a setting that introduces classification tasks sequentially and requires the net to master new tasks while preserving mastery of previously learned tasks. This setting corresponds to that which human learners face as they acquire domain expertise, for example, as an individual reads a textbook chapter-by-chapter. Through simulations involving sequences of 10 related tasks, we find reason for optimism that nets will scale well as they advance from having a single skill to becoming domain experts. We observed two key phenomena. First, forward facilitation—the accelerated learning of task n+1 having learned n previous tasks—grows with n. Second, backward interference—the forgetting of the n previous tasks when learning task n + 1—diminishes with n. Forward facilitation is the goal of research on metalearning, and reduced backward interference is the goal of research on ameliorating catastrophic forgetting. We find that both of these goals are attained simply through broader exposure to a domain. In a standard supervised learning setting, neural networks are trained to perform a single task, such as classification, defined in terms of a discriminative distribution p(y |x,D) for labels y conditioned on input x given a data set D. Although such models are useful in engineering applications, they do not reflect the breadth of human intelligence, which depends on the capability to perform arbitrary tasks in a context-dependent manner. Multitask learning (Caruana, 1997) is concerned with performing any one of n tasks, usually by having multiple heads on a neural network to produce outputs appropriate for each task, cast formally in Anonymous Institution, Anonymous City, Anonymous Region, Anonymous Country. Correspondence to: Anonymous Author . Preliminary work. Under review by the ICML 2019 Workshop “Identifying and Understanding Deep Learning Phenomena”. Do not distribute. terms of the distribution p(yi |x,D1, . . . ,Dn), where the subscript denotes a task index and i ∈ {1, . . . , n} is an arbitrary task. When related, multiple tasks can provide a useful inductive bias to extract shared structure (Caruana, 1993), and as a regularization method to guide toward solutions helpful on a variety of problems (Ruder, 2017). Multitask learning is typically framed in terms of simultaneous training on all tasks, but humans and artificial agents operating in naturalistic settings more typically tackle tasks sequentially and need to maintain mastery of previously learned tasks as they acquire a new one. Consider students reading a calculus text in which each chapter presents a different method. Early on, engaging with a chapter and its associated exercises will lead to forgetting of the material they had previously mastered. However, as more knowledge is acquired, students learn to effectively scaffold knowledge and eventually are able to leverage prior experience to integrate the new material with the old. As the final chapters are studied, students have built a strong conceptual framework which facilitates the integration of new material with little disruption of the old. In this article, we study the machine-learning analog of our hypothetical students. The punch line of the article is that a generic neural network trained sequentially to acquire and maintain mastery of multiple tasks behaves similarly to human learners, exhibiting faster acquisition of new knowledge and less disruption of previously acquired knowledge with diverse domain experience. 1. Sequential multitask learning Early research investigating sequential training observed catastrophic forgetting (McCloskey & Cohen, 1989), characterized by a dramatic drop in task 1 performance following training on task 2, i.e., the accuracy of the model p(y1 |x,D1 → D2) is significantly lower than accuracy of the model p(y1 |x,D1), where the arrow denotes training sequence. Parisi et al. (2019) review efforts to quantify and reduce catastrophic forgetting, including specialized mechanisms that aim to facilitate sequential learning. A second line of research exploring sequential training is the active topic of metalearning, or learning to learn

[1]  R. A. Bailey,et al.  Design of comparative experiments , 2008 .

[2]  Marcin Andrychowicz,et al.  Learning to learn by gradient descent by gradient descent , 2016, NIPS.

[3]  Christopher N. Wahlheim,et al.  Memory consequences of looking back to notice change: Retroactive and proactive facilitation. , 2015, Journal of experimental psychology. Learning, memory, and cognition.

[4]  Allen and Rosenbloom Paul S. Newell,et al.  Mechanisms of Skill Acquisition and the Law of Practice , 1993 .

[5]  Pieter Abbeel,et al.  A Simple Neural Attentive Meta-Learner , 2017, ICLR.

[6]  L. Postman The Present Status of Interference Theory. , 1961 .

[7]  Michael McCloskey,et al.  Catastrophic Interference in Connectionist Networks: The Sequential Learning Problem , 1989 .

[8]  C. Osgood,et al.  An investigation into the causes of retroactive interference. , 1948, Journal of experimental psychology.

[9]  Marc'Aurelio Ranzato,et al.  Gradient Episodic Memory for Continual Learning , 2017, NIPS.

[10]  Stefan Wermter,et al.  Continual Lifelong Learning with Neural Networks: A Review , 2018, Neural Networks.

[11]  D. Ausubel,et al.  Retroactive inhibition and facilitation in the learning of school materials. , 1957 .

[12]  Sebastian Thrun,et al.  Is Learning The n-th Thing Any Easier Than Learning The First? , 1995, NIPS.

[13]  Marco Baroni,et al.  Generalization without Systematicity: On the Compositional Skills of Sequence-to-Sequence Recurrent Networks , 2017, ICML.

[14]  Rich Caruana,et al.  Multitask Learning: A Knowledge-Based Source of Inductive Bias , 1993, ICML.

[15]  Li Fei-Fei,et al.  CLEVR: A Diagnostic Dataset for Compositional Language and Elementary Visual Reasoning , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[16]  Marco Baroni,et al.  Rearranging the Familiar: Testing Compositional Generalization in Recurrent Networks , 2018, BlackboxNLP@EMNLP.

[17]  Zeb Kurth-Nelson,et al.  Learning to reinforcement learn , 2016, CogSci.

[18]  Yoshua Bengio,et al.  Learning a synaptic learning rule , 1991, IJCNN-91-Seattle International Joint Conference on Neural Networks.

[19]  Razvan Pascanu,et al.  A simple neural network module for relational reasoning , 2017, NIPS.

[20]  J. Fodor,et al.  Connectionism and cognitive architecture: A critical analysis , 1988, Cognition.

[21]  J. Hardy,et al.  Piecewise power laws in individual learning curves , 2015, Psychonomic Bulletin & Review.

[22]  Sergey Levine,et al.  Model-Agnostic Meta-Learning for Fast Adaptation of Deep Networks , 2017, ICML.

[23]  Rich Caruana,et al.  Multitask Learning , 1997, Machine-mediated learning.