Tensor Product Generation Networks

We present a new tensor product generation network (TPGN) that generates natural language descriptions for images. The model has a novel architecture that instantiates a general framework for encoding and processing symbolic structure through neural network computation. This framework is built on Tensor Product Representations (TPRs). We evaluated the proposed TPGN on the MS COCO image captioning task. The experimental results show that the TPGN outperforms the LSTM based state-of-the-art baseline with a significant margin. Further, we show that our caption generation model can be interpreted as generating sequences of grammatical categories and retrieving words by their categories from a plan encoded as a distributed representation.

[1]  C. Lawrence Zitnick,et al.  CIDEr: Consensus-based image description evaluation , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[2]  Zhe Gan,et al.  Semantic Compositional Networks for Visual Captioning , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[3]  Geoffrey E. Hinton Tensor Product Variable Binding and the Representation of Symbolic Structures in Connectionist Systems , 1991 .

[4]  Pyeong Whan Cho,et al.  Incremental parsing in a continuous dynamical system: sentence processing in Gradient Symbolic Computation , 2017 .

[5]  Joe Pater The harmonic mind : from neural computation to optimality-theoretic grammar , 2009 .

[6]  Jeffrey Pennington,et al.  GloVe: Global Vectors for Word Representation , 2014, EMNLP.

[7]  Fei-Fei Li,et al.  Deep visual-semantic alignments for generating image descriptions , 2015, CVPR.

[8]  Alon Lavie,et al.  METEOR: An Automatic Metric for MT Evaluation with Improved Correlation with Human Judgments , 2005, IEEvaluation@ACL.

[9]  Ruslan Salakhutdinov,et al.  Multimodal Neural Language Models , 2014, ICML.

[10]  Wei Xu,et al.  Deep Captioning with Multimodal Recurrent Neural Networks (m-RNN) , 2014, ICLR.

[11]  Christopher D. Manning,et al.  Improved Semantic Representations From Tree-Structured Long Short-Term Memory Networks , 2015, ACL.

[12]  Salim Roukos,et al.  Bleu: a Method for Automatic Evaluation of Machine Translation , 2002, ACL.

[13]  Trevor Darrell,et al.  Multimodal Compact Bilinear Pooling for Visual Question Answering and Visual Grounding , 2016, EMNLP.

[14]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[15]  Geoffrey Zweig,et al.  Language Models for Image Captioning: The Quirks and What Works , 2015, ACL.

[16]  Richard Socher,et al.  Ask Me Anything: Dynamic Memory Networks for Natural Language Processing , 2015, ICML.

[17]  Xinlei Chen,et al.  Mind's eye: A recurrent visual representation for image caption generation , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[18]  William Ramsey THE HARMONIC MIND: FROM NEURAL COMPUTATION TO OPTIMALITY‐THEORETIC GRAMMAR—VOLUME 1: COGNITIVE ARCHITECTURE AND VOLUME 2: LINGUISTIC AND PHILOSOPHICAL IMPLICATIONS , 2009 .

[19]  Ruslan Salakhutdinov,et al.  Unifying Visual-Semantic Embeddings with Multimodal Neural Language Models , 2014, ArXiv.

[20]  Jordan B. Pollack,et al.  Recursive Distributed Representations , 1990, Artif. Intell..

[21]  Dan Klein,et al.  Deep Compositional Question Answering with Neural Module Networks , 2015, ArXiv.

[22]  Stephen Clark,et al.  Jointly learning sentence embeddings and syntax with unsupervised Tree-LSTMs , 2017, Natural Language Engineering.

[23]  P. Smolensky Symbolic functions from neural computation , 2012, Philosophical Transactions of the Royal Society A: Mathematical, Physical and Engineering Sciences.

[24]  Xiaokui Shu,et al.  Natural Language Toolkit (NLTK) , 2010 .

[25]  Li Deng,et al.  Question-Answering with Grammatically-Interpretable Representations , 2017, AAAI.

[26]  Samy Bengio,et al.  Show and tell: A neural image caption generator , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[27]  Jürgen Schmidhuber,et al.  Long Short-Term Memory , 1997, Neural Computation.

[28]  Wang Ling,et al.  Learning to Compose Words into Sentences with Reinforcement Learning , 2016, ICLR.

[29]  Trevor Darrell,et al.  Long-term recurrent convolutional networks for visual recognition and description , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[30]  David Weiss,et al.  DRAGNN: A Transition-based Framework for Dynamically Connected Neural Networks , 2017, ArXiv.