How Much Do Language Models Copy From Their Training Data? Evaluating Linguistic Novelty in Text Generation Using RAVEN

Current language models can generate highquality text. Are they simply copying text they have seen before, or have they learned generalizable linguistic abstractions? To tease apart these possibilities, we introduce RAVEN, a suite of analyses for assessing the novelty of generated text, focusing on sequential structure (n-grams) and syntactic structure. We apply these analyses to four neural language models (an LSTM, a Transformer, Transformer-XL, and GPT-2). For local structure—e.g., individual dependencies—model-generated text is substantially less novel than our baseline of human-generated text from each model’s test set. For larger-scale structure— e.g., overall sentence structure—modelgenerated text is as novel or even more novel than the human-generated baseline, but models still sometimes copy substantially, in some cases duplicating passages over 1,000 words long from the training set. We also perform extensive manual analysis showing that GPT-2’s novel text is usually well-formed morphologically and syntactically but has reasonably frequent semantic issues (e.g., being self-contradictory).

[1]  Geoffrey K. Pullum,et al.  Inflectional Morphology and Related Matters , 2002 .

[2]  Joseph Weizenbaum,et al.  ELIZA—a computer program for the study of natural language communication between man and machine , 1966, CACM.

[3]  Christopher D. Manning,et al.  Do Massively Pretrained Language Models Make Better Storytellers? , 2019, CoNLL.

[4]  Mark Chen,et al.  Language Models are Few-Shot Learners , 2020, NeurIPS.

[5]  Chu-Ren Huang,et al.  Lexical word formation , 2016 .

[6]  Stephen Wan,et al.  GLEU: Automatic Evaluation of Sentence-Level Fluency , 2007, ACL.

[7]  Emily M. Bender,et al.  Climbing towards NLU: On Meaning, Form, and Understanding in the Age of Data , 2020, ACL.

[8]  J. Berko The Child's Learning of English Morphology , 1958 .

[9]  Vitaly Shmatikov,et al.  Membership Inference Attacks Against Machine Learning Models , 2016, 2017 IEEE Symposium on Security and Privacy (SP).

[10]  Guigang Zhang,et al.  Deep Learning , 2016, Int. J. Semantic Comput..

[11]  Philipp Koehn,et al.  Moses: Open Source Toolkit for Statistical Machine Translation , 2007, ACL.

[12]  Edouard Grave,et al.  Colorless Green Recurrent Networks Dream Hierarchically , 2018, NAACL.

[13]  R. Thomas McCoy,et al.  Does Syntax Need to Grow on Trees? Sources of Hierarchical Inductive Bias in Sequence-to-Sequence Networks , 2020, TACL.

[14]  Lukasz Kaiser,et al.  Attention is All you Need , 2017, NIPS.

[15]  J. Pine,et al.  Slot and frame patterns and the development of the determiner category , 1997, Applied Psycholinguistics.

[16]  Robert F. Hadley Systematicity in Connectionist Language Learning , 1994 .

[17]  Yiming Yang,et al.  Transformer-XL: Attentive Language Models beyond a Fixed-Length Context , 2019, ACL.

[18]  Steven Pinker,et al.  Generalisation of regular and irregular morphological patterns , 1993 .

[19]  Douglas Eck,et al.  Deduplicating Training Data Makes Language Models Better , 2021, ArXiv.

[20]  Noah A. Smith,et al.  Recurrent Neural Network Grammars , 2016, NAACL.

[21]  Yejin Choi,et al.  The Curious Case of Neural Text Degeneration , 2019, ICLR.

[22]  Mathijs Mul,et al.  Compositionality Decomposed: How do Neural Networks Generalise? , 2019, J. Artif. Intell. Res..

[23]  Myle Ott,et al.  fairseq: A Fast, Extensible Toolkit for Sequence Modeling , 2019, NAACL.

[24]  Charles Yang,et al.  The Price of Linguistic Productivity: How Children Learn to Break the Rules of Language , 2016 .

[25]  Marco Baroni,et al.  Generalization without Systematicity: On the Compositional Skills of Sequence-to-Sequence Recurrent Networks , 2017, ICML.

[26]  Moustapha Cissé,et al.  Efficient softmax approximation for GPUs , 2016, ICML.

[27]  Timothy O'Donnell,et al.  Productivity and Reuse in Language: A Theory of Linguistic Computation and Storage , 2015 .

[28]  Zenon W. Pylyshyn,et al.  Connectionism and cognitive architecture: A critical analysis , 1988, Cognition.

[29]  Yejin Choi,et al.  Scarecrow: A Framework for Scrutinizing Machine Text , 2021, ArXiv.

[30]  Úlfar Erlingsson,et al.  The Secret Sharer: Evaluating and Testing Unintended Memorization in Neural Networks , 2018, USENIX Security Symposium.

[31]  Charles Cole,et al.  Fluid concepts and creative analogies: Computer models of the fundamental mechanisms of thought , 1996 .

[32]  Ludovica Pannitto,et al.  Recurrent babbling: evaluating the acquisition of grammar from limited input data , 2020, CONLL.

[33]  Wojciech Zaremba,et al.  Evaluating Large Language Models Trained on Code , 2021, ArXiv.

[34]  Charles Yang,et al.  Ontogeny and phylogeny of language , 2013, Proceedings of the National Academy of Sciences.

[35]  H. Hughes The Cambridge Grammar of the English Language , 2003 .

[36]  Jianfeng Gao,et al.  PlotMachines: Outline-Conditioned Generation with Dynamic Plot State Tracking , 2020, EMNLP.

[37]  Emily M. Bender,et al.  On the Dangers of Stochastic Parrots: Can Language Models Be Too Big? 🦜 , 2021, FAccT.

[38]  Alec Radford,et al.  Scaling Laws for Neural Language Models , 2020, ArXiv.

[39]  B. Hayes,et al.  Rules vs. analogy in English past tenses: a computational/experimental study , 2003, Cognition.

[40]  Gary Marcus,et al.  The Next Decade in AI: Four Steps Towards Robust Artificial Intelligence , 2020, ArXiv.

[41]  Thomas G. Dietterich Overfitting and undercomputing in machine learning , 1995, CSUR.

[42]  Thomas L. Griffiths,et al.  Distinguishing rule- and exemplar-based generalization in learning systems , 2021, ArXiv.

[43]  Yongjing Yin,et al.  On Compositional Generalization of Neural Machine Translation , 2021, ACL.

[44]  Omer Levy,et al.  Generalization through Memorization: Nearest Neighbor Language Models , 2020, ICLR.

[45]  Percy Liang,et al.  Unifying Human and Statistical Evaluation for Natural Language Generation , 2019, NAACL.

[46]  Nicolas Usunier,et al.  Improving Neural Language Models with a Continuous Cache , 2016, ICLR.

[47]  Rico Sennrich,et al.  Neural Machine Translation of Rare Words with Subword Units , 2015, ACL.

[48]  Jason Weston,et al.  Dialogue Natural Language Inference , 2018, ACL.

[49]  Richard Socher,et al.  Pointer Sentinel Mixture Models , 2016, ICLR.

[50]  Elizabeth Clark,et al.  Evaluation of Text Generation: A Survey , 2020, ArXiv.

[51]  Lav R. Varshney,et al.  CTRL: A Conditional Transformer Language Model for Controllable Generation , 2019, ArXiv.

[52]  Kilian Q. Weinberger,et al.  BERTScore: Evaluating Text Generation with BERT , 2019, ICLR.

[53]  Yoav Goldberg,et al.  Studying the Inductive Biases of RNNs with Synthetic Variations of Natural Languages , 2019, NAACL.

[54]  Jürgen Schmidhuber,et al.  Long Short-Term Memory , 1997, Neural Computation.

[55]  Samy Bengio,et al.  Understanding deep learning (still) requires rethinking generalization , 2021, Commun. ACM.

[56]  Mirella Lapata,et al.  Automatic Evaluation of Text Coherence: Models and Representations , 2005, IJCAI.

[57]  S. Kuczaj The acquisition of regular and irregular past tense forms , 1977 .

[58]  Lior Wolf,et al.  Using the Output Embedding to Improve Language Models , 2016, EACL.

[59]  Emiel Krahmer,et al.  Survey of the State of the Art in Natural Language Generation: Core tasks, applications and evaluation , 2017, J. Artif. Intell. Res..

[60]  Alexei Baevski,et al.  Adaptive Input Representations for Neural Language Modeling , 2018, ICLR.

[61]  Tal Linzen,et al.  COGS: A Compositional Generalization Challenge Based on Semantic Interpretation , 2020, EMNLP.

[62]  Richard Socher,et al.  Evaluating the Factual Consistency of Abstractive Text Summarization , 2019, EMNLP.

[63]  Ilya Sutskever,et al.  Language Models are Unsupervised Multitask Learners , 2019 .

[64]  Dan Klein,et al.  Constituency Parsing with a Self-Attentive Encoder , 2018, ACL.

[65]  Jason Weston,et al.  Don't Say That! Making Inconsistent Dialogue Unlikely with Unlikelihood Training , 2020, ACL.

[66]  S Pinker,et al.  Overregularization in language acquisition. , 1992, Monographs of the Society for Research in Child Development.

[67]  Clara Meister,et al.  Language Model Evaluation Beyond Perplexity , 2021, ACL.

[68]  Jianfeng Gao,et al.  Enhancing the Transformer with Explicit Relational Encoding for Math Problem Solving , 2019, ArXiv.