Deep Learning of Grammatically-Interpretable Representations Through Question-Answering

We introduce an architecture in which internal representations— learned by end-to-end optimization in a deep neural network performing a textual question-answering task—can be interpreted using basic concepts from linguistic theory. This interpretability comes at a cost of only a few percentage-point reduction in accuracy relative to the original model on which the new one is based (BiDAF [1]). The internal representation that is interpreted is a Tensor Product Representation: for each input word, the model selects a symbol to encode the word, and a role in which to place the symbol, and binds the two together. The selection is via soft attention. The overall interpretation is built from interpretations of the symbols, as recruited by the trained model, and interpretations of the roles as used by the model. We find support for our initial hypothesis that symbols can be interpreted as lexical-semantic word meanings, while roles can be interpreted as approximations of grammatical roles (or categories) such as subject, wh-word, determiner, etc. Through extremely detailed, fine-grained analysis, we find specific correspondences between the learned roles and parts of speech as assigned by a standard parser [2], and find several discrepancies in the model’s favor. In this sense, the model learns significant aspects of grammar, after having been exposed solely to linguistically unannotated text, questions, and answers: no prior linguistic knowledge is given to the model. What is given is the means ∗This work was carried out while PS was on leave from Johns Hopkins University. LD is currently at Citadel. 1 ar X iv :1 70 5. 08 43 2v 1 [ cs .C L ] 2 3 M ay 2 01 7 to represent using symbols and roles and an inductive bias favoring use of these in an approximately discrete manner.

[1]  Quoc V. Le,et al.  Neural Programmer: Inducing Latent Programs with Gradient Descent , 2015, ICLR.

[2]  Yoshua Bengio,et al.  Attention-Based Models for Speech Recognition , 2015, NIPS.

[3]  Jian Zhang,et al.  SQuAD: 100,000+ Questions for Machine Comprehension of Text , 2016, EMNLP.

[4]  Ali Farhadi,et al.  Bidirectional Attention Flow for Machine Comprehension , 2016, ICLR.

[5]  Yoshua Bengio,et al.  Show, Attend and Tell: Neural Image Caption Generation with Visual Attention , 2015, ICML.

[6]  Jason Weston,et al.  Memory Networks , 2014, ICLR.

[7]  Martín Abadi,et al.  TensorFlow: Large-Scale Machine Learning on Heterogeneous Distributed Systems , 2016, ArXiv.

[8]  Rabab Kreidieh Ward,et al.  Deep Sentence Embedding Using Long Short-Term Memory Networks: Analysis and Application to Information Retrieval , 2015, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[9]  Jason Weston,et al.  End-To-End Memory Networks , 2015, NIPS.

[10]  Matthew Goldrick,et al.  Optimization and Quantization in Gradient Symbol Systems: A Framework for Integrating the Continuous and the Discrete in Cognition , 2014, Cogn. Sci..

[11]  Jason Weston,et al.  Tracking the World State with Recurrent Entity Networks , 2016, ICLR.

[12]  Graham W. Taylor,et al.  Deconvolutional networks , 2010, 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition.

[13]  P. Smolensky Representation in Connectionist Networks , 1990 .

[14]  Sergio Gomez Colmenarejo,et al.  Hybrid computing using a neural network with dynamic external memory , 2016, Nature.

[15]  P. Smolensky Symbolic functions from neural computation , 2012, Philosophical Transactions of the Royal Society A: Mathematical, Physical and Engineering Sciences.

[16]  Joe Pater The harmonic mind : from neural computation to optimality-theoretic grammar , 2009 .

[17]  Ming Zhou,et al.  Gated Self-Matching Networks for Reading Comprehension and Question Answering , 2017, ACL.

[18]  Jürgen Schmidhuber,et al.  Long Short-Term Memory , 1997, Neural Computation.

[19]  Rob Fergus,et al.  Visualizing and Understanding Convolutional Networks , 2013, ECCV.

[20]  Dan Klein,et al.  Feature-Rich Part-of-Speech Tagging with a Cyclic Dependency Network , 2003, NAACL.

[21]  Matthew D. Zeiler ADADELTA: An Adaptive Learning Rate Method , 2012, ArXiv.

[22]  Jeffrey Pennington,et al.  GloVe: Global Vectors for Word Representation , 2014, EMNLP.

[23]  Alex Graves,et al.  Neural Turing Machines , 2014, ArXiv.