Systematic Generalization and Emergent Structures in Transformers Trained on Structured Tasks

Transformer networks have seen great success in natural language processing and machine vision, where task objectives such as next word prediction and image classification benefit from nuanced context sensitivity across high-dimensional inputs. However, there is an ongoing debate about how and when transformers can acquire highly structured behavior and achieve systematic generalization. Here, we explore how well a causal transformer can perform a set of algorithmic tasks, including copying, sorting, and hierarchical compositions of these operations. We demonstrate strong generalization to sequences longer than those used in training by replacing the standard positional encoding typically used in transformers with labels arbitrarily paired with items in the sequence. We search for the layer and head configuration sufficient to solve these tasks, then probe for signs of systematic processing in latent representations and attention patterns. We show that two-layer transformers learn reliable solutions to multi-level problems, develop signs of task decomposition, and encode input items in a way that encourages the exploitation of shared computation across related tasks. These results provide key insights into how attention layers support structured computation both within a task and across multiple tasks.

[1]  Mehdi Abbana Bennani,et al.  Randomized Positional Encodings Boost Length Generalization of Transformers , 2023, ACL.

[2]  David Sussillo,et al.  Flexible multitask computation in recurrent networks utilizes shared dynamical motifs , 2022, bioRxiv.

[3]  James L. McClelland,et al.  Language models show human-like content effects on reasoning , 2022, ArXiv.

[4]  Yuhuai Wu,et al.  Exploring Length Generalization in Large Language Models , 2022, NeurIPS.

[5]  Pedro A. Ortega,et al.  Neural Networks and the Chomsky Hierarchy , 2022, ICLR.

[6]  Eric Schulz,et al.  Using cognitive psychology to understand GPT-3 , 2022, Proceedings of the National Academy of Sciences of the United States of America.

[7]  Adrià Puigdomènech Badia,et al.  The CLRS Algorithmic Reasoning Benchmark , 2022, ICML.

[8]  Ian S. Fischer,et al.  Multi-Game Decision Transformers , 2022, NeurIPS.

[9]  Sergio Gomez Colmenarejo,et al.  A Generalist Agent , 2022, Trans. Mach. Learn. Res..

[10]  R. Thomas McCoy,et al.  Neurocompositional computing: From the Central Paradox of Cognition to a new generation of AI systems , 2022, AI Mag..

[11]  Omer Levy,et al.  Transformer Language Models without Positional Encodings Still Learn Positional Information , 2022, EMNLP.

[12]  Matt Gardner,et al.  Impact of Pretraining Term Frequencies on Few-Shot Reasoning , 2022, ArXiv.

[13]  Yuri Burda,et al.  Grokking: Generalization Beyond Overfitting on Small Algorithmic Datasets , 2022, ArXiv.

[14]  J. Schmidhuber,et al.  The Neural Data Router: Adaptive Control Flow in Transformers Improves Systematic Generalization , 2021, ICLR.

[15]  Noah A. Smith,et al.  Train Short, Test Long: Attention with Linear Biases Enables Input Length Extrapolation , 2021, ICLR.

[16]  J. Schmidhuber,et al.  The Devil is in the Detail: Simple Tricks Improve Systematic Generalization of Transformers , 2021, EMNLP.

[17]  Michael S. Bernstein,et al.  On the Opportunities and Risks of Foundation Models , 2021, ArXiv.

[18]  J. Ainslie,et al.  Making Transformers Solve Compositional Tasks , 2021, ACL.

[19]  Sergey Levine,et al.  Offline Reinforcement Learning as One Big Sequence Modeling Problem , 2021, NeurIPS.

[20]  Pieter Abbeel,et al.  Decision Transformer: Reinforcement Learning via Sequence Modeling , 2021, NeurIPS.

[21]  Jianlin Su,et al.  RoFormer: Enhanced Transformer with Rotary Position Embedding , 2021, Neurocomputing.

[22]  Alec Radford,et al.  Zero-Shot Text-to-Image Generation , 2021, ICML.

[23]  Lior Wolf,et al.  Transformer Interpretability Beyond Attention Visualization , 2020, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[24]  S. Gelly,et al.  An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale , 2020, ICLR.

[25]  Omer Levy,et al.  Emergent linguistic structure in artificial neural networks trained by self-supervision , 2020, Proceedings of the National Academy of Sciences.

[26]  Mark Chen,et al.  Language Models are Few-Shot Learners , 2020, NeurIPS.

[27]  Marco Baroni,et al.  Syntactic Structure from Deep Learning , 2020, Annual Review of Linguistics.

[28]  Manaal Faruqui,et al.  Attention Interpretability Across NLP Tasks , 2019, ArXiv.

[29]  Yuval Pinter,et al.  Attention is not not Explanation , 2019, EMNLP.

[30]  Fedor Moiseev,et al.  Analyzing Multi-Head Self-Attention: Specialized Heads Do the Heavy Lifting, the Rest Can Be Pruned , 2019, ACL.

[31]  Omer Levy,et al.  Are Sixteen Heads Really Better than One? , 2019, NeurIPS.

[32]  Byron C. Wallace,et al.  Attention is not Explanation , 2019, NAACL.

[33]  Madhura R. Joglekar,et al.  Task representations in neural networks trained to perform many cognitive tasks , 2019, Nature Neuroscience.

[34]  Marco Baroni,et al.  Generalization without Systematicity: On the Compositional Skills of Sequence-to-Sequence Recurrent Networks , 2017, ICML.

[35]  Lukasz Kaiser,et al.  Attention is All you Need , 2017, NIPS.

[36]  Alex Graves,et al.  Neural Turing Machines , 2014, ArXiv.

[37]  J. Fodor,et al.  Connectionism and cognitive architecture: A critical analysis , 1988, Cognition.

[38]  Ming-Wei Chang,et al.  BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.