论文信息 - Illustrative Language Understanding: Large-Scale Visual Grounding with Image Search - 字舞流文

Illustrative Language Understanding: Large-Scale Visual Grounding with Image Search

We introduce Picturebook, a large-scale lookup operation to ground language via ‘snapshots’ of our physical world accessed through image search. For each word in a vocabulary, we extract the top-k images from Google image search and feed the images through a convolutional network to extract a word embedding. We introduce a multimodal gating function to fuse our Picturebook embeddings with other word representations. We also introduce Inverse Picturebook, a mechanism to map a Picturebook embedding back into words. We experiment and report results across a wide range of tasks: word similarity, natural language inference, semantic relatedness, sentiment/topic classification, image-sentence ranking and machine translation. We also show that gate activations corresponding to Picturebook embeddings are highly correlated to human judgments of concreteness ratings.

Geoffrey E. Hinton | William Chan | Jamie Ryan Kiros | William Chan | J. Kiros

[1] Sabine Schulte im Walde,et al. Exploring Multi-Modal Text+Image Models to Distinguish between Abstract and Concrete Nouns , 2017 .

[2] Ido Dagan,et al. context2vec: Learning Generic Context Embedding with Bidirectional LSTM , 2016, CoNLL.

[3] Gang Wang,et al. Look, Imagine and Match: Improving Textual-Visual Cross-Modal Retrieval with Generative Models , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[4] Christopher Potts,et al. A large annotated corpus for learning natural language inference , 2015, EMNLP.

[5] Chong Wang,et al. Towards Neural Phrase-based Machine Translation , 2017, ICLR.

[6] Holger Schwenk,et al. Supervised Learning of Universal Sentence Representations from Natural Language Inference Data , 2017, EMNLP.

[7] Pietro Perona,et al. Microsoft COCO: Common Objects in Context , 2014, ECCV.

[8] Massimo Poesio,et al. Visually Grounded and Textual Semantic Models Differentially Decode Brain Activity Associated with Concrete and Abstract Nouns , 2017, TACL.

[9] Chengqi Zhang,et al. Reinforced Self-Attention Network: a Hybrid of Hard and Soft Attention for Sequence Modeling , 2018, IJCAI.

[10] Quoc V. Le,et al. Sequence to Sequence Learning with Neural Networks , 2014, NIPS.

[11] Yoshua Bengio,et al. Neural Machine Translation by Jointly Learning to Align and Translate , 2014, ICLR.

[12] Allan Jabri,et al. Learning Visually Grounded Sentence Representations , 2018, NAACL.

[13] Geoffrey E. Hinton,et al. Regularizing Neural Networks by Penalizing Confident Output Distributions , 2017, ICLR.

[14] Samuel R. Bowman,et al. A Broad-Coverage Challenge Corpus for Sentence Understanding through Inference , 2017, NAACL.

[15] Felix Hill,et al. Learning Abstract Concept Embeddings from Multi-Modal Data: Since You Probably Can’t See What I Mean , 2014, EMNLP.

[16] Anders Søgaard,et al. Limitations of Cross-Lingual Learning from Image Search , 2017, Rep4NLP@ACL.

[17] Amy Beth Warriner,et al. Concreteness ratings for 40 thousand generally known English word lemmas , 2014, Behavior research methods.

[18] Marc'Aurelio Ranzato,et al. Sequence Level Training with Recurrent Neural Networks , 2015, ICLR.

[19] Yan Huang,et al. Learning Semantic Concepts and Order for Image and Sentence Matching , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[20] Marco Marelli,et al. A SICK cure for the evaluation of compositional distributional semantic models , 2014, LREC.

[21] Joost van de Weijer,et al. LIUM-CVC Submissions for WMT18 Multimodal Translation Task , 2018, WMT.

[22] Léon Bottou,et al. Learning Image Embeddings using Convolutional Neural Networks for Improved Multi-Modal Semantics , 2014, EMNLP.

[23] Sergey Ioffe,et al. Rethinking the Inception Architecture for Computer Vision , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[24] Angeliki Lazaridou,et al. Combining Language and Vision with a Multimodal Skip-gram Model , 2015, NAACL.

[25] Richard Socher,et al. Learned in Translation: Contextualized Word Vectors , 2017, NIPS.

[26] Xiang Zhang,et al. Character-level Convolutional Networks for Text Classification , 2015, NIPS.

[27] Marie-Francine Moens,et al. Imagined Visual Representations as Multimodal Embeddings , 2017, AAAI.

[28] Laure Soulier,et al. Learning Multi-Modal Word Representation Grounded in Visual Context , 2017, AAAI.

[29] Chandra Bhagavatula,et al. Semi-supervised sequence tagging with bidirectional language models , 2017, ACL.

[30] Randy Goebel,et al. Using Visual Information to Predict Lexical Preference , 2011, RANLP.

[31] Xi Chen,et al. Stacked Cross Attention for Image-Text Matching , 2018, ECCV.

[32] Goran Glavas,et al. If Sentences Could See: Investigating Visual Information for Semantic Textual Similarity , 2017, IWCS.

[33] Allan Jabri,et al. Learning Visual Features from Large Weakly Supervised Data , 2015, ECCV.

[34] Fabio A. González,et al. Gated Multimodal Units for Information Fusion , 2017, ICLR.

[35] Lior Wolf,et al. Using the Output Embedding to Improve Language Models , 2016, EACL.

[36] Benjamin Van Durme,et al. Learning Bilingual Lexicons Using the Visual Similarity of Labeled Web Images , 2011, IJCAI.

[37] Honglak Lee,et al. An efficient framework for learning sentence representations , 2018, ICLR.

[38] Yang Song,et al. Learning Fine-Grained Image Similarity with Deep Ranking , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[39] Stephen Clark,et al. Improving Multi-Modal Representations Using Image Dispersion: Why Less is Sometimes More , 2014, ACL.

[40] Jimmy Ba,et al. Adam: A Method for Stochastic Optimization , 2014, ICLR.

[41] Alan L. Yuille,et al. Training and Evaluating Multimodal Word Embeddings with Large-scale Web Annotated Images , 2016, NIPS.

[42] Desmond Elliott,et al. Findings of the Second Shared Task on Multimodal Machine Translation and Multilingual Image Description , 2017, WMT.

[43] Jürgen Schmidhuber,et al. Long Short-Term Memory , 1997, Neural Computation.

[44] Tomas Mikolov,et al. Efficient Large-Scale Multi-Modal Classification , 2018, AAAI.

[45] Jeffrey Pennington,et al. GloVe: Global Vectors for Word Representation , 2014, EMNLP.

[46] Achim Rettinger,et al. Towards Holistic Concept Representations: Embedding Relational Knowledge, Visual Attributes, and Distributional Word Semantics , 2017, International Semantic Web Conference.

[47] Luke S. Zettlemoyer,et al. Deep Contextualized Word Representations , 2018, NAACL.

[48] Erhardt Barth,et al. Recurrent Dropout without Memory Loss , 2016, COLING.

[49] Geoffrey E. Hinton,et al. Layer Normalization , 2016, ArXiv.

[50] Stephen Clark,et al. Comparing Data Sources and Architectures for Deep Visual Representation Learning in Semantics , 2016, EMNLP.

[51] Alexander M. Rush,et al. Sequence-to-Sequence Learning as Beam-Search Optimization , 2016, EMNLP.

[52] Stephen Clark,et al. Visual Bilingual Lexicon Induction with Transferred ConvNet Features , 2015, EMNLP.

[53] Joelle Pineau,et al. An Actor-Critic Algorithm for Sequence Prediction , 2016, ICLR.

[54] Felix Hill,et al. SimLex-999: Evaluating Semantic Models With (Genuine) Similarity Estimation , 2014, CL.

[55] Rico Sennrich,et al. Neural Machine Translation of Rare Words with Subword Units , 2015, ACL.

[56] Sungzoon Cho,et al. Distance-based Self-Attention Network for Natural Language Inference , 2017, ArXiv.

[57] Ruslan Salakhutdinov,et al. Unifying Visual-Semantic Embeddings with Multimodal Neural Language Models , 2014, ArXiv.

[58] Marie-Francine Moens,et al. Multi-Modal Representations for Improved Bilingual Lexicon Learning , 2016, ACL.

[59] Carina Silberer,et al. Visually Grounded Meaning Representations , 2017, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[60] Stephen Clark,et al. Vision and Feature Norms: Improving automatic feature norm learning through cross-modal maps , 2016, HLT-NAACL.

[61] Christopher D. Manning,et al. Improved Semantic Representations From Tree-Structured Long Short-Term Memory Networks , 2015, ACL.

[62] Jean Maillard,et al. Black Holes and White Rabbits: Metaphor Identification with Visual Features , 2016, NAACL.

[63] Jiajun Zhang,et al. Learning Multimodal Word Representation via Dynamic Fusion Methods , 2018, AAAI.

[64] Sanja Fidler,et al. Skip-Thought Vectors , 2015, NIPS.

[65] Alon Lavie,et al. Meteor Universal: Language Specific Translation Evaluation for Any Target Language , 2014, WMT@ACL.

[66] Elia Bruni,et al. Multimodal Distributional Semantics , 2014, J. Artif. Intell. Res..

[67] Grzegorz Chrupala,et al. Learning language through pictures , 2015, ACL.

[68] Douwe Kiela. MMFeat: A Toolkit for Extracting Multi-Modal Features , 2016, ACL.

[69] Stephen Clark,et al. Exploiting Image Generality for Lexical Entailment Detection , 2015, ACL.

[70] David J. Fleet,et al. VSE++: Improved Visual-Semantic Embeddings , 2017, ArXiv.

[71] Khalil Sima'an,et al. Multi30K: Multilingual English-German Image Descriptions , 2016, VL@ACL.

[72] Yann LeCun,et al. Very Deep Convolutional Networks for Text Classification , 2016, EACL.

[73] Marcello Federico,et al. Report on the 11th IWSLT evaluation campaign , 2014, IWSLT.

[74] Stephen Clark,et al. Speaking, Seeing, Understanding: Correlating semantic models with conceptual representation in the brain , 2017, EMNLP.

[75] Felix Hill,et al. Learning Distributed Representations of Sentences from Unlabelled Data , 2016, NAACL.

[76] Hakan Inan,et al. Tying Word Vectors and Word Classifiers: A Loss Framework for Language Modeling , 2016, ICLR.

[77] Jian Sun,et al. Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[78] Marie-Francine Moens,et al. Is an Image Worth More than a Thousand Words? On the Fine-Grain Semantic Differences between Visual and Linguistic Representations , 2016, COLING.

[79] Tomas Mikolov,et al. Bag of Tricks for Efficient Text Classification , 2016, EACL.