The Neural Data Router: Adaptive Control Flow in Transformers Improves Systematic Generalization

Despite progress across a broad range of applications, Transformers have limited success in systematic generalization. The situation is especially frustrating in the case of algorithmic tasks, where they often fail to find intuitive solutions that route relevant information to the right node/operation at the right time in the grid represented by Transformer columns. To facilitate the learning of useful control flow, we propose two modifications to the Transformer architecture, copy gate and geometric attention. Our novel Neural Data Router (NDR) achieves 100% length generalization accuracy on the classic compositional table lookup task, as well as near-perfect accuracy on the simple arithmetic task and a new variant of ListOps testing for generalization across computational depths. NDR’s attention and gating patterns tend to be interpretable as an intuitive form of neural routing . Our code is public. 1

[1]  J. Fodor,et al.  Connectionism and cognitive architecture: A critical analysis , 1988, Cognition.

[2]  J. Fodor,et al.  Connectionism and the problem of systematicity: Why Smolensky's solution doesn't work , 1990, Cognition.

[3]  Stephen José Hanson,et al.  A stochastic version of the delta rule , 1990 .

[4]  Jürgen Schmidhuber,et al.  Learning Complex, Extended Sequences Using the Principle of History Compression , 1992, Neural Computation.

[5]  Jürgen Schmidhuber,et al.  Long Short-Term Memory , 1997, Neural Computation.

[6]  Kuldip K. Paliwal,et al.  Bidirectional recurrent neural networks , 1997, IEEE Trans. Signal Process..

[7]  Nitish Srivastava,et al.  Dropout: a simple way to prevent neural networks from overfitting , 2014, J. Mach. Learn. Res..

[8]  Jürgen Schmidhuber,et al.  Training Very Deep Networks , 2015, NIPS.

[9]  Alex Graves,et al.  Adaptive Computation Time for Recurrent Neural Networks , 2016, ArXiv.

[10]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[11]  Alexander M. Rush,et al.  Character-Aware Neural Language Models , 2015, AAAI.

[12]  Lukasz Kaiser,et al.  Attention is All you Need , 2017, NIPS.

[13]  Yann Dauphin,et al.  Language Modeling with Gated Convolutional Networks , 2016, ICML.

[14]  Marco Baroni,et al.  Memorize or generalize? Searching for a compositional RNN in a haystack , 2018, ArXiv.

[15]  Marco Baroni,et al.  Generalization without Systematicity: On the Compositional Skills of Sequence-to-Sequence Recurrent Networks , 2017, ICML.

[16]  Samuel R. Bowman,et al.  ListOps: A Diagnostic Dataset for Latent Tree Learning , 2018, NAACL.

[17]  Anand Singh,et al.  Learning compositionally through attentive guidance , 2018, ArXiv.

[18]  Ashish Vaswani,et al.  Self-Attention with Relative Position Representations , 2018, NAACL.

[19]  Marco Baroni,et al.  CNNs found to jump around more skillfully than RNNs: Compositional Generalization in Seq2seq Convolutional Networks , 2019, ACL.

[20]  Liang Zhao,et al.  Compositional Generalization for Primitive Substitutions , 2019, EMNLP.

[21]  Aaron C. Courville,et al.  Ordered Memory , 2019, NeurIPS.

[22]  Hermann Ney,et al.  Language Modeling with Deep Transformers , 2019, INTERSPEECH.

[23]  Brenden M. Lake,et al.  Compositional generalization through meta sequence-to-sequence learning , 2019, NeurIPS.

[24]  Yoshua Bengio,et al.  CLOSURE: Assessing Systematic Generalization of CLEVR Models , 2019, ViGIL@NeurIPS.

[25]  Lukasz Kaiser,et al.  Universal Transformers , 2018, ICLR.

[26]  Armand Joulin,et al.  Cooperative Learning of Disjoint Syntax and Semantics , 2019, NAACL.

[27]  Elia Bruni,et al.  Transcoding Compositionally: Using Attention to Find More Generalizable Solutions , 2019, BlackboxNLP@ACL.

[28]  Yiming Yang,et al.  Transformer-XL: Attentive Language Models beyond a Fixed-Length Context , 2019, ACL.

[29]  Frank Hutter,et al.  Decoupled Weight Decay Regularization , 2017, ICLR.

[30]  Jürgen Schmidhuber,et al.  Improving Differentiable Neural Computers Through Memory Masking, De-allocation, and Link Distribution Sharpness Control , 2019, ICLR.

[31]  Ming-Wei Chang,et al.  BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.

[32]  Yoshua Bengio,et al.  Compositional generalization in a deep seq2seq model by separating syntax and semantics , 2019, ArXiv.

[33]  Marc van Zee,et al.  Compositional Generalization in Semantic Parsing: Pre-training vs. Specialized Architectures , 2020, ArXiv.

[34]  Qian Liu,et al.  Compositional Generalization by Learning Analytical Expressions , 2020, NeurIPS.

[35]  Razvan Pascanu,et al.  Stabilizing Transformers for Reinforcement Learning , 2019, ICML.

[36]  Xiao Wang,et al.  Measuring Compositional Generalization: A Comprehensive Method on Realistic Data , 2019, ICLR.

[37]  Mark Chen,et al.  Language Models are Few-Shot Learners , 2020, NeurIPS.

[38]  Mathijs Mul,et al.  Compositionality Decomposed: How do Neural Networks Generalise? , 2019, J. Artif. Intell. Res..

[39]  Kazuki Irie,et al.  Advancing neural language modeling in automatic speech recognition , 2020 .

[40]  Chen Liang,et al.  Compositional Generalization via Neural-Symbolic Stack Machines , 2020, NeurIPS.

[41]  Noam Shazeer,et al.  GLU Variants Improve Transformer , 2020, ArXiv.

[42]  Klaus Greff,et al.  On the Binding Problem in Artificial Neural Networks , 2020, ArXiv.

[43]  Elia Bruni,et al.  Location Attention for Extrapolation to Longer Sequences , 2019, ACL.

[44]  Eran Yahav,et al.  Thinking Like Transformers , 2021, ICML.

[45]  E. Kharitonov,et al.  Can Transformers Jump Around Right in Natural Language? Assessing Performance Transfer from SCAN , 2021, BLACKBOXNLP.

[46]  S. Gelly,et al.  An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale , 2020, ICLR.

[47]  Charles Blundell,et al.  PonderNet: Learning to Ponder , 2021, ArXiv.

[48]  J. Ainslie,et al.  Making Transformers Solve Compositional Tasks , 2021, ACL.

[49]  Glenn M. Fung,et al.  Nyströmformer: A Nyström-Based Algorithm for Approximating Self-Attention , 2021, AAAI.

[50]  Liu Yang,et al.  Long Range Arena: A Benchmark for Efficient Transformers , 2020, ICLR.

[51]  Kazuki Irie,et al.  Going Beyond Linear Transformers with Recurrent Fast Weight Programmers , 2021, NeurIPS.

[52]  Richard L. Lewis,et al.  Reinforcement Learning of Implicit and Explicit Control Flow in Instructions , 2021, ICML.

[53]  Ming-Wei Chang,et al.  Compositional Generalization and Natural Language Variation: Can a Semantic Parsing Approach Handle Both? , 2020, ACL.

[54]  Cornelia Caragea,et al.  Modeling Hierarchical Structures with Continuous Recursive Neural Networks , 2021, ICML.

[55]  J. Schmidhuber,et al.  The Devil is in the Detail: Simple Tricks Improve Systematic Generalization of Transformers , 2021, EMNLP.