论文信息 - Systematic Generalization and Emergent Structures in Transformers Trained on Structured Tasks

Systematic Generalization and Emergent Structures in Transformers Trained on Structured Tasks

Transformer networks have seen great success in natural language processing and machine vision, where task objectives such as next word prediction and image classification benefit from nuanced context sensitivity across high-dimensional inputs. However, there is an ongoing debate about how and when transformers can acquire highly structured behavior and achieve systematic generalization. Here, we explore how well a causal transformer can perform a set of algorithmic tasks, including copying, sorting, and hierarchical compositions of these operations. We demonstrate strong generalization to sequences longer than those used in training by replacing the standard positional encoding typically used in transformers with labels arbitrarily paired with items in the sequence. We search for the layer and head configuration sufficient to solve these tasks, then probe for signs of systematic processing in latent representations and attention patterns. We show that two-layer transformers learn reliable solutions to multi-level problems, develop signs of task decomposition, and encode input items in a way that encourages the exploitation of shared computation across related tasks. These results provide key insights into how attention layers support structured computation both within a task and across multiple tasks.

James L. McClelland | Yuxuan Li

[1] Mehdi Abbana Bennani,et al. Randomized Positional Encodings Boost Length Generalization of Transformers , 2023, ACL.

[2] David Sussillo,et al. Flexible multitask computation in recurrent networks utilizes shared dynamical motifs , 2022, bioRxiv.

[3] James L. McClelland,et al. Language models show human-like content effects on reasoning , 2022, ArXiv.

[4] Yuhuai Wu,et al. Exploring Length Generalization in Large Language Models , 2022, NeurIPS.

[5] Pedro A. Ortega,et al. Neural Networks and the Chomsky Hierarchy , 2022, ICLR.

[6] Eric Schulz,et al. Using cognitive psychology to understand GPT-3 , 2022, Proceedings of the National Academy of Sciences of the United States of America.

[7] Adrià Puigdomènech Badia,et al. The CLRS Algorithmic Reasoning Benchmark , 2022, ICML.

[8] Ian S. Fischer,et al. Multi-Game Decision Transformers , 2022, NeurIPS.

[9] Sergio Gomez Colmenarejo,et al. A Generalist Agent , 2022, Trans. Mach. Learn. Res..

[10] R. Thomas McCoy,et al. Neurocompositional computing: From the Central Paradox of Cognition to a new generation of AI systems , 2022, AI Mag..

[11] Omer Levy,et al. Transformer Language Models without Positional Encodings Still Learn Positional Information , 2022, EMNLP.

[12] Matt Gardner,et al. Impact of Pretraining Term Frequencies on Few-Shot Reasoning , 2022, ArXiv.

[13] Yuri Burda,et al. Grokking: Generalization Beyond Overfitting on Small Algorithmic Datasets , 2022, ArXiv.

[14] J. Schmidhuber,et al. The Neural Data Router: Adaptive Control Flow in Transformers Improves Systematic Generalization , 2021, ICLR.

[15] Noah A. Smith,et al. Train Short, Test Long: Attention with Linear Biases Enables Input Length Extrapolation , 2021, ICLR.

[16] J. Schmidhuber,et al. The Devil is in the Detail: Simple Tricks Improve Systematic Generalization of Transformers , 2021, EMNLP.

[17] Michael S. Bernstein,et al. On the Opportunities and Risks of Foundation Models , 2021, ArXiv.

[18] J. Ainslie,et al. Making Transformers Solve Compositional Tasks , 2021, ACL.