论文信息 - Mix-review: Alleviate Forgetting in the Pretrain-Finetune Framework for Neural Language Generation Models - 字舞流文

Mix-review: Alleviate Forgetting in the Pretrain-Finetune Framework for Neural Language Generation Models

In this work, we study how the large-scale pretrain-finetune framework changes the behavior of a neural language generator. We focus on the transformer encoder-decoder model for the open-domain dialogue response generation task. We find that after standard fine-tuning, the model forgets important language generation skills acquired during large-scale pre-training. We demonstrate the forgetting phenomenon through a detailed behavior analysis from the perspectives of context sensitivity and knowledge transfer. Adopting the concept of data mixing, we propose an intuitive fine-tuning strategy named "mix-review". We find that mix-review effectively regularize the fine-tuning process, and the forgetting problem is largely alleviated. Finally, we discuss interesting behavior of the resulting dialogue model and its implications.

Bing Liu | Jun Liu | Fuchun Peng | Kyunghyun Cho | Myle Ott | Tianxing He | James Glass

[1] Liang Lu,et al. Multitask Learning with Low-Level Auxiliary Tasks for Encoder-Decoder Based Speech Recognition , 2017, INTERSPEECH.

[2] Sanja Fidler,et al. Skip-Thought Vectors , 2015, NIPS.

[3] Yejin Choi,et al. The Curious Case of Neural Text Degeneration , 2019, ICLR.

[4] Erik Cambria,et al. Recent Trends in Deep Learning Based Natural Language Processing , 2017, IEEE Comput. Intell. Mag..

[5] Sebastian Riedel,et al. Language Models as Knowledge Bases? , 2019, EMNLP.

[6] Omer Levy,et al. RoBERTa: A Robustly Optimized BERT Pretraining Approach , 2019, ArXiv.

[7] Jimmy Ba,et al. Adam: A Method for Stochastic Optimization , 2014, ICLR.

[8] Ming-Wei Chang,et al. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.

[9] Christopher D. Manning,et al. Effective Approaches to Attention-based Neural Machine Translation , 2015, EMNLP.

[10] M. Riemer,et al. Representation Stability as a Regularizer for Improved Text Analytics Transfer Learning , 2017, arXiv.org.

[11] Jianfeng Gao,et al. A Diversity-Promoting Objective Function for Neural Conversation Models , 2015, NAACL.

[12] Yiming Yang,et al. XLNet: Generalized Autoregressive Pretraining for Language Understanding , 2019, NeurIPS.

[13] Yu Zhang,et al. Flexible End-to-End Dialogue System for Knowledge Grounded Conversation , 2017, ArXiv.

[14] Mariana L. Neves,et al. Neural Domain Adaptation for Biomedical Question Answering , 2017, CoNLL.

[15] Yang Feng,et al. Knowledge Diffusion for Neural Dialogue Generation , 2018, ACL.

[16] Xiang Bai,et al. Mask TextSpotter: An End-to-End Trainable Neural Network for Spotting Text with Arbitrary Shapes , 2018, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[17] James R. Glass,et al. Detecting egregious responses in neural sequence-to-sequence models , 2018, ICLR.

[18] Ilya Sutskever,et al. Language Models are Unsupervised Multitask Learners , 2019 .

[19] Jason Weston,et al. Importance of a Search Strategy in Neural Dialogue Modelling , 2018, ArXiv.

[20] Xu Tan,et al. MASS: Masked Sequence to Sequence Pre-training for Language Generation , 2019, ICML.

[21] Leland McInnes,et al. UMAP: Uniform Manifold Approximation and Projection , 2018, J. Open Source Softw..

[22] Xiaoyu Shen,et al. DailyDialog: A Manually Labelled Multi-turn Dialogue Dataset , 2017, IJCNLP.

[23] Yoshua Bengio,et al. Why Does Unsupervised Pre-training Help Deep Learning? , 2010, AISTATS.

[24] Lukasz Kaiser,et al. Attention is All you Need , 2017, NIPS.

[25] Christopher Joseph Pal,et al. Do Neural Dialog Systems Use the Conversation History Effectively? An Empirical Study , 2019, ACL.

[26] Wenpeng Yin,et al. Empirical evaluation of multi-task learning in deep neural networks for natural language processing , 2020, Neural Computing and Applications.

[27] Jan Niehues,et al. Exploiting Linguistic Resources for Neural Machine Translation Using Multi-task Learning , 2017, WMT.

[28] Quoc V. Le,et al. Do Language Models Have Common Sense , 2018 .

[29] Anthony V. Robins,et al. Catastrophic Forgetting, Rehearsal and Pseudorehearsal , 1995, Connect. Sci..

[30] Brendan McCane,et al. Pseudo-Recursal: Solving the Catastrophic Forgetting Problem in Deep Neural Networks , 2018, ArXiv.

[31] Yoshua Bengio,et al. Learning Phrase Representations using RNN Encoder–Decoder for Statistical Machine Translation , 2014, EMNLP.

[32] Myle Ott,et al. fairseq: A Fast, Extensible Toolkit for Sequence Modeling , 2019, NAACL.

[33] Joelle Pineau,et al. How NOT To Evaluate Your Dialogue System: An Empirical Study of Unsupervised Evaluation Metrics for Dialogue Response Generation , 2016, EMNLP.

[34] Quoc V. Le,et al. Sequence to Sequence Learning with Neural Networks , 2014, NIPS.

[35] Lukás Burget,et al. Recurrent neural network based language model , 2010, INTERSPEECH.

[36] Luke S. Zettlemoyer,et al. Deep Contextualized Word Representations , 2018, NAACL.

[37] Tiancheng Zhao,et al. Pretraining Methods for Dialog Context Representation Learning , 2019, ACL.

[38] Rico Sennrich,et al. Neural Machine Translation of Rare Words with Subword Units , 2015, ACL.

[39] Thomas Wolf,et al. TransferTransfo: A Transfer Learning Approach for Neural Network Based Conversational Agents , 2019, ArXiv.

[40] Cristian Danescu-Niculescu-Mizil,et al. Chameleons in Imagined Conversations: A New Approach to Understanding Coordination of Linguistic Style in Dialogs , 2011, CMCL@ACL.

[41] Noah A. Smith,et al. Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , 2016, ACL 2016.

[42] Yann Dauphin,et al. Hierarchical Neural Story Generation , 2018, ACL.

[43] James R. Glass,et al. Negative Training for Neural Dialogue Response Generation , 2019, ACL.

[44] Marc'Aurelio Ranzato,et al. Real or Fake? Learning to Discriminate Machine from Human Generated Text , 2019, ArXiv.