Glyce: Glyph-vectors for Chinese Character Representations

It is intuitive that NLP tasks for logographic languages like Chinese should benefit from the use of the glyph information in those languages. However, due to the lack of rich pictographic evidence in glyphs and the weak generalization ability of standard computer vision models on character data, an effective way to utilize the glyph information remains to be found. In this paper, we address this gap by presenting Glyce, the glyph-vectors for Chinese character representations. We make three major innovations: (1) We use historical Chinese scripts (e.g., bronzeware script, seal script, traditional Chinese, etc) to enrich the pictographic evidence in characters; (2) We design CNN structures (called tianzege-CNN) tailored to Chinese character image processing; and (3) We use image-classification as an auxiliary task in a multi-task learning setup to increase the model's ability to generalize. We show that glyph-based models are able to consistently outperform word/char ID-based models in a wide range of Chinese NLP tasks. We are able to set new state-of-the-art results for a variety of Chinese NLP tasks, including tagging (NER, CWS, POS), sentence pair classification, single sentence classification tasks, dependency parsing, and semantic role labeling. For example, the proposed model achieves an F1 score of 80.6 on the OntoNotes dataset of NER, +1.5 over BERT; it achieves an almost perfect accuracy of 99.8\% on the Fudan corpus for text classification. Code found at this https URL.

[1]  Yue Zhang,et al.  Word-Context Character Embeddings for Chinese Word Segmentation , 2017, EMNLP.

[2]  Kuzman Ganchev,et al.  Semantic Role Labeling with Neural Network Factors , 2015, EMNLP.

[3]  Jianfeng Gao,et al.  Bi-directional Attention with Agreement for Dependency Parsing , 2016, EMNLP.

[4]  Geoffrey E. Hinton,et al.  ImageNet classification with deep convolutional neural networks , 2012, Commun. ACM.

[5]  Rui Li,et al.  Multi-Granularity Chinese Word Embedding , 2016, EMNLP.

[6]  Noah A. Smith,et al.  Training with Exploration Improves a Greedy Stack LSTM Parser , 2016, EMNLP.

[7]  Jürgen Schmidhuber,et al.  Highway Networks , 2015, ArXiv.

[8]  Nan Yang,et al.  Radical-Enhanced Chinese Character Embedding , 2014, ICONIP.

[9]  Huanhuan Chen,et al.  Improve Chinese Word Embeddings by Exploiting Internal Structure , 2016, NAACL.

[10]  Mirella Lapata,et al.  Neural Semantic Role Labeling with Dependency Path Embeddings , 2016, ACL.

[11]  Jian Zhang,et al.  Natural Language Inference over Interaction Space , 2017, ICLR.

[12]  Hung-yi Lee,et al.  Learning Chinese Word Representations From Glyphs Of Characters , 2017, EMNLP.

[13]  Fang Kong,et al.  Building Chinese Discourse Corpus with Connective-driven Dependency Tree Structure , 2014, EMNLP.

[14]  Yoshimasa Tsuruoka,et al.  A Joint Many-Task Model: Growing a Neural Network for Multiple NLP Tasks , 2016, EMNLP.

[15]  Lukasz Kaiser,et al.  Attention is All you Need , 2017, NIPS.

[16]  Guodong Zhou,et al.  Modeling Source Syntax for Neural Machine Translation , 2017, ACL.

[17]  Hai Zhao,et al.  Fast and Accurate Neural Word Segmentation for Chinese , 2017, ACL.

[18]  Hao Xin,et al.  Joint Embeddings of Chinese Words, Characters, and Fine-grained Subcharacter Components , 2017, EMNLP.

[19]  Frederick Liu,et al.  Learning Character-level Compositionality with Visual Features , 2017, ACL.

[20]  Ming-Wei Chang,et al.  BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.

[21]  Huanbo Luan,et al.  Prior Knowledge Integration for Neural Machine Translation using Posterior Regularization , 2017, ACL.

[22]  Jeffrey Dean,et al.  Distributed Representations of Words and Phrases and their Compositionality , 2013, NIPS.

[23]  Xiang Zhang,et al.  Character-level Convolutional Networks for Text Classification , 2015, NIPS.

[24]  Alexander M. Rush,et al.  Character-Aware Neural Language Models , 2015, AAAI.

[25]  Hai Zhao,et al.  Syntax for Semantic Role Labeling, To Be, Or Not To Be , 2018, ACL.

[26]  Timothy Dozat,et al.  Deep Biaffine Attention for Neural Dependency Parsing , 2016, ICLR.

[27]  Yue Zhang,et al.  Subword Encoding in Lattice LSTM for Chinese Word Segmentation , 2018, NAACL.

[28]  Hai Zhao,et al.  Neural Word Segmentation Learning for Chinese , 2016, ACL.

[29]  Chao Liu,et al.  Radical Embedding: Delving Deeper to Chinese Radicals , 2015, ACL.

[30]  Ji Ma,et al.  State-of-the-art Chinese Word Segmentation with Bi-LSTMs , 2018, EMNLP.

[31]  Salim Roukos,et al.  Bleu: a Method for Automatic Evaluation of Machine Translation , 2002, ACL.

[32]  Hen-Hsen Huang,et al.  A Unified RvNN Framework for End-to-End Chinese Discourse Parsing , 2018, COLING.

[33]  Guodong Zhou,et al.  MCDTB: A Macro-level Chinese Discourse TreeBank , 2018, COLING.

[34]  Li Fei-Fei,et al.  ImageNet: A large-scale hierarchical image database , 2009, CVPR.

[35]  Falcon Z. Dai,et al.  Glyph-aware Embedding of Chinese Characters , 2017, SWCN@EMNLP.

[36]  Diego Marcheggiani,et al.  Encoding Sentences with Graph Convolutional Networks for Semantic Role Labeling , 2017, EMNLP.

[37]  Yoshua Bengio,et al.  A Character-level Decoder without Explicit Segmentation for Neural Machine Translation , 2016, ACL.

[38]  Xuanjing Huang,et al.  Adversarial Multi-Criteria Learning for Chinese Word Segmentation , 2017, ACL.

[39]  Bo Pang,et al.  Thumbs up? Sentiment Classification using Machine Learning Techniques , 2002, EMNLP.

[40]  Yue Zhang,et al.  Neural Word Segmentation with Rich Pretraining , 2017, ACL.

[41]  Christopher D. Manning,et al.  Improved Semantic Representations From Tree-Structured Long Short-Term Memory Networks , 2015, ACL.

[42]  Zhiguo Wang,et al.  Bilateral Multi-Perspective Matching for Natural Language Sentences , 2017, IJCAI.

[43]  Richard H. R. Hahnloser,et al.  Character-level Chinese-English Translation through ASCII Encoding , 2018, WMT.

[44]  Xin Liu,et al.  LCQMC:A Large-scale Chinese Question Matching Corpus , 2018, COLING.

[45]  Eduard H. Hovy,et al.  End-to-end Sequence Labeling via Bi-directional LSTM-CNNs-CRF , 2016, ACL.

[46]  Christopher D. Manning,et al.  Achieving Open Vocabulary Neural Machine Translation with Hybrid Word-Character Models , 2016, ACL.

[47]  Nancy F. Chen,et al.  Multimodal neural pronunciation modeling for spoken languages with logographic origin , 2018, EMNLP.

[48]  Jeffrey Pennington,et al.  GloVe: Global Vectors for Word Representation , 2014, EMNLP.

[49]  Luke S. Zettlemoyer,et al.  Deep Contextualized Word Representations , 2018, NAACL.

[50]  Xiaoqing Zheng,et al.  Deep Learning for Chinese Word Segmentation and POS Tagging , 2013, EMNLP.

[51]  Yue Zhang,et al.  Chinese NER Using Lattice LSTM , 2018, ACL.

[52]  Alec Radford,et al.  Improving Language Understanding by Generative Pre-Training , 2018 .

[53]  Shujian Huang,et al.  Improved Neural Machine Translation with a Syntax-Aware Encoder and Decoder , 2017, ACL.

[54]  Sergey Ioffe,et al.  Rethinking the Inception Architecture for Computer Vision , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[55]  Noah A. Smith,et al.  Transition-Based Dependency Parsing with Stack Long Short-Term Memory , 2015, ACL.

[56]  Xiangyu Zhang,et al.  ShuffleNet V2: Practical Guidelines for Efficient CNN Architecture Design , 2018, ECCV.

[57]  Christopher Potts,et al.  A large annotated corpus for learning natural language inference , 2015, EMNLP.

[58]  Xu Sun,et al.  Bag-of-Words as Target for Neural Machine Translation , 2018, ACL.

[59]  Eliyahu Kiperwasser,et al.  Simple and Accurate Dependency Parsing Using Bidirectional LSTM Feature Representations , 2016, TACL.

[60]  Eduard H. Hovy,et al.  Recursive Deep Models for Discourse Parsing , 2014, EMNLP.

[61]  Wei Chu,et al.  Toward Fast and Accurate Neural Chinese Word Segmentation with Multi-Criteria Learning , 2020, COLING.

[62]  Jun Zhou,et al.  cw2vec: Learning Chinese Word Embeddings with Stroke n-gram Information , 2018, AAAI.

[63]  Christopher Potts,et al.  Recursive Deep Models for Semantic Compositionality Over a Sentiment Treebank , 2013, EMNLP.

[64]  Jingdong Wang,et al.  Interleaved Group Convolutions , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[65]  Jacob Eisenstein,et al.  Representation Learning for Text-level Discourse Parsing , 2014, ACL.

[66]  Xuanjing Huang,et al.  Long Short-Term Memory Neural Networks for Chinese Word Segmentation , 2015, EMNLP.

[67]  Xiang Zhang,et al.  Which Encoding is the Best for Text Classification in Chinese, English, Japanese and Korean? , 2017, ArXiv.

[68]  Jason Weston,et al.  Natural Language Processing (Almost) from Scratch , 2011, J. Mach. Learn. Res..

[69]  Wenjie Li,et al.  Component-Enhanced Chinese Character Embeddings , 2015, EMNLP.

[70]  Zhiyuan Liu,et al.  Joint Learning of Character and Word Embeddings , 2015, IJCAI.

[71]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[72]  Tomas Mikolov,et al.  Bag of Tricks for Efficient Text Classification , 2016, EACL.

[73]  Danqi Chen,et al.  A Fast and Accurate Dependency Parser using Neural Networks , 2014, EMNLP.

[74]  Jörg Tiedemann,et al.  Character-based Joint Segmentation and POS Tagging for Chinese using Bidirectional RNN-CRF , 2017, IJCNLP.

[75]  Xin Liu,et al.  The BQ Corpus: A Large-scale Domain-specific Chinese Corpus For Sentence Semantic Equivalence Identification , 2018, EMNLP.