HUBERT Untangles BERT to Improve Transfer across NLP Tasks

We introduce HUBERT which combines the structured-representational power of Tensor-Product Representations (TPRs) and BERT, a pre-trained bidirectional Transformer language model. We show that there is shared structure between different NLP datasets that HUBERT, but not BERT, is able to learn and leverage. We validate the effectiveness of our model on the GLUE benchmark and HANS dataset. Our experiment results show that untangling data-specific semantics from general language structure is key for better transfer among NLP tasks.

[1]  Alex Wang,et al.  Can You Tell Me How to Get Past Sesame Street? Sentence-Level Pretraining Beyond Language Modeling , 2018, ACL.

[2]  Dipanjan Das,et al.  BERT Rediscovers the Classical NLP Pipeline , 2019, ACL.

[3]  Omer Levy,et al.  What Does BERT Look at? An Analysis of BERT’s Attention , 2019, BlackboxNLP@ACL.

[4]  Christopher D. Manning,et al.  A Structural Probe for Finding Syntax in Word Representations , 2019, NAACL.

[5]  Jason Weston,et al.  Towards AI-Complete Question Answering: A Set of Prerequisite Toy Tasks , 2015, ICLR.

[6]  Ming-Wei Chang,et al.  BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.

[7]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[8]  Li Deng,et al.  Tensor Product Generation Networks for Deep NLP Modeling , 2017, NAACL.

[9]  Robert Frank,et al.  Open Sesame: Getting inside BERT’s Linguistic Knowledge , 2019, BlackboxNLP@ACL.

[10]  Xiaohan Chen,et al.  Can We Gain More from Orthogonality Regularizations in Training Deep CNNs? , 2018, NeurIPS.

[11]  Christopher Potts,et al.  A large annotated corpus for learning natural language inference , 2015, EMNLP.

[12]  Geoffrey E. Hinton Tensor Product Variable Binding and the Representation of Symbolic Structures in Connectionist Systems , 1991 .

[13]  Allen Newell,et al.  Physical Symbol Systems , 1980, Cogn. Sci..

[14]  Omer Levy,et al.  RoBERTa: A Robustly Optimized BERT Pretraining Approach , 2019, ArXiv.

[15]  Richard Socher,et al.  Unifying Question Answering and Text Classification via Span Extraction , 2019, ArXiv.

[16]  Christopher Joseph Pal,et al.  On orthogonality and learning recurrent networks with long term dependencies , 2017, ICML.

[17]  Simon D. Levy,et al.  Vector Symbolic Architectures: A New Building Material for Artificial General Intelligence , 2008, AGI.

[18]  Quoc V. Le,et al.  Sequence to Sequence Learning with Neural Networks , 2014, NIPS.

[19]  Xiaodong Liu,et al.  Multi-Task Deep Neural Networks for Natural Language Understanding , 2019, ACL.

[20]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[21]  Jian Zhang,et al.  SQuAD: 100,000+ Questions for Machine Comprehension of Text , 2016, EMNLP.

[22]  Martin Wattenberg,et al.  Visualizing and Measuring the Geometry of BERT , 2019, NeurIPS.

[23]  Hai Zhao,et al.  Semantics-aware BERT for Language Understanding , 2020, AAAI.

[24]  Jürgen Schmidhuber,et al.  Long Short-Term Memory , 1997, Neural Computation.

[25]  Jürgen Schmidhuber,et al.  Learning to Reason with Third-Order Tensor Products , 2018, NeurIPS.

[26]  Anna Rumshisky,et al.  Revealing the Dark Secrets of BERT , 2019, EMNLP.

[27]  Richard Socher,et al.  Unifying Question Answering, Text Classification, and Regression via Span Extraction , 2019 .

[28]  Jianfeng Gao,et al.  Reasoning in Vector Space: An Exploratory Study of Question Answering , 2016, ICLR.

[29]  R. Thomas McCoy,et al.  Right for the Wrong Reasons: Diagnosing Syntactic Heuristics in Natural Language Inference , 2019, ACL.

[30]  Omer Levy,et al.  Are Sixteen Heads Really Better than One? , 2019, NeurIPS.

[31]  Géraldine Legendre,et al.  The Harmonic Mind: From Neural Computation to Optimality-Theoretic GrammarVolume I: Cognitive Architecture (Bradford Books) , 2006 .

[32]  Geoffrey E. Hinton,et al.  A Simple Way to Initialize Recurrent Networks of Rectified Linear Units , 2015, ArXiv.

[33]  Omer Levy,et al.  GLUE: A Multi-Task Benchmark and Analysis Platform for Natural Language Understanding , 2018, BlackboxNLP@EMNLP.

[34]  Li Deng,et al.  Question-Answering with Grammatically-Interpretable Representations , 2017, AAAI.

[35]  Samuel R. Bowman,et al.  Sentence Encoders on STILTs: Supplementary Training on Intermediate Labeled-data Tasks , 2018, ArXiv.

[36]  Lukasz Kaiser,et al.  Attention is All you Need , 2017, NIPS.

[37]  Tony A. Plate,et al.  Holographic reduced representations , 1995, IEEE Trans. Neural Networks.