Comparing Simple Recurrent Networks and n-Grams in a Large Corpus

The increased availability of text corpora and the growth of connectionism has stimulated a renewed interest in probabilistic models of language processing in computational linguistics and psycholinguistics. The Simple Recurrent Network (SRN) is an important connectionist model because it has the potential to learn temporal dependencies of unspecified length. In addition, many computational questions about the SRN's ability to learn dependencies between individual items extend to other models. This paper will report on experiments with an SRN trained on a large corpus and examine the ability of the network to learn bigrams, trigrams, etc., as a function of the size of the corpus. The performance is evaluated by an information theoretic measure of prediction (or guess) ranking and output vector entropy. With enough training and hidden units the SRN shows the ability to learn 5 and 6-gram dependencies, although learning an n-gram is contingent on its frequency and the relative frequency of other n-grams. In some cases, the network will learn relatively low frequency deep dependencies before relatively high frequency short ones if the deep dependencies do not require representational shifts in hidden unit space.

[1]  M. Goldsmith,et al.  Statistical Learning by 8-Month-Old Infants , 1996 .

[2]  V. Marchman,et al.  From rote learning to system building: acquiring verb morphology in children and connectionist nets , 1993, Cognition.

[3]  Sandiway Fong,et al.  On the Applicability of Neural Network and Machine Learning Methodologies to Natural Language Processing , 1998 .

[4]  Yoshua Bengio,et al.  Learning long-term dependencies with gradient descent is difficult , 1994, IEEE Trans. Neural Networks.

[5]  D Zipser,et al.  Learning the hidden structure of speech. , 1988, The Journal of the Acoustical Society of America.

[6]  Geoffrey E. Hinton,et al.  A time-delay neural network architecture for isolated word recognition , 1990, Neural Networks.

[7]  Garrison W. Cottrell,et al.  Acquiring the Mapping from Meaning to Sounds , 1994, Connect. Sci..

[8]  Peter W. Foltz,et al.  Learning Human-like Knowledge by Singular Value Decomposition: A Progress Report , 1997, NIPS.

[9]  Mark F. St. John The Story Gestalt: A Model of Knowledge-Intensive Processes in Text Comprehension , 1992, Cogn. Sci..

[10]  J. Elman Distributed Representations, Simple Recurrent Networks, And Grammatical Structure , 1991 .

[11]  James L. McClelland,et al.  Mechanisms of Sentence Processing: Assigning Roles to Constituents of Sentences , 1986 .

[12]  Robert L. Mercer,et al.  An Estimate of an Upper Bound for the Entropy of English , 1992, CL.

[13]  James L. McClelland,et al.  Finite State Automata and Simple Recurrent Networks , 1989, Neural Computation.

[14]  Morten H. Christiansen,et al.  Learning to Segment Speech Using Multiple Cues: A Connectionist Model , 1998 .

[15]  Thomas M. Cover,et al.  Elements of Information Theory , 1991 .

[16]  Nick Chater,et al.  Distributional Information: A Powerful Cue for Acquiring Syntactic Categories , 1998, Cogn. Sci..

[17]  William D. Marslen-Wilson,et al.  A connectionist model of phonological representation in speech perception , 1995 .

[18]  R N Aslin,et al.  Statistical Learning by 8-Month-Old Infants , 1996, Science.

[19]  R. Shillcock,et al.  The role of phonotactic range in the order of acquisition of English consonants , 1997 .

[20]  Stephan K. Chalup,et al.  Natural Language Learning by Recurrent Neural Networks: A Comparison with probabilistic approaches , 1998, CoNLL.

[21]  Robert L. Mercer,et al.  Class-Based n-gram Models of Natural Language , 1992, CL.

[22]  Thomas M. Cover,et al.  A convergent gambling estimate of the entropy of English , 1978, IEEE Trans. Inf. Theory.

[23]  Emmanuel J. Yannakoudakis,et al.  n-Grams and their implication to natural language understanding , 1990, Pattern Recognit..

[24]  E. Judith Weiner A Nowledge Representation Approach to Understanding Metaphors , 1984, Comput. Linguistics.

[25]  Nick Chater,et al.  Toward a connectionist model of recursion in human linguistic performance , 1999 .

[26]  Curt Burgess,et al.  Modelling Parsing Constraints with High-dimensional Context Space , 1997 .

[27]  Eugene Charniak,et al.  Statistical language learning , 1997 .

[28]  Jeffrey L. Elman,et al.  Default Generalisation in Connectionist Networks. , 1995 .

[29]  Geoffrey E. Hinton,et al.  Learning internal representations by error propagation , 1986 .

[30]  M. Brent Speech segmentation and word discovery: a computational perspective , 1999, Trends in Cognitive Sciences.

[31]  T. A. Cartwright,et al.  Distributional regularity and phonotactic constraints are useful for segmentation , 1996, Cognition.

[32]  F ChenStanley,et al.  An empirical study of smoothing techniques for language modeling , 1999 .

[33]  Geoffrey E. Hinton,et al.  Distributed Representations , 1986, The Philosophy of Artificial Intelligence.


[35]  James L. McClelland,et al.  Learning Subsequential Structure in Simple Recurrent Networks , 1988, NIPS.

[36]  Nick Chater,et al.  Connectionist natural language processing: the state of the art , 1999, Cogn. Sci..

[37]  Daniel Jurafsky,et al.  A Probabilistic Model of Lexical and Syntactic Access and Disambiguation , 1996, Cogn. Sci..

[38]  Kurt Hornik,et al.  A Convergence Result for Learning in Recurrent Neural Networks , 1994, Neural Computation.

[39]  J. Elman,et al.  Default Generalization in Connectionist Networks , 1995 .

[40]  Kenneth Ward Church,et al.  Introduction to the Special Issue on Computational Linguistics Using Large Corpora , 1993, Comput. Linguistics.

[41]  Mark S. Seidenberg,et al.  Language Acquisition and Use: Learning and Applying Probabilistic Constraints , 1997, Science.

[42]  C. Lee Giles,et al.  Extracting and Learning an Unknown Grammar with Recurrent Neural Networks , 1991, NIPS.

[43]  Claude E. Shannon,et al.  Prediction and entropy of printed English , 1951 .

[44]  Andrew S. Noetzel,et al.  Sequence Recognition with Recurrent Neural Networks , 1993 .

[45]  James L. McClelland,et al.  On learning the past-tenses of English verbs: implicit rules or parallel distributed processing , 1986 .

[46]  William D. Marslen-Wilson,et al.  A Connectionist Model of Phonological Representation in Speech Perception , 1995, Cogn. Sci..