论文信息 - On the Applicability of Neural Network and Machine Learning Methodologies to Natural Language Processing

On the Applicability of Neural Network and Machine Learning Methodologies to Natural Language Processing

How can we apply neural network and machine learning methodologies to natural language processing? In this paper we consider the task of training a neural network to classify natural language sentences as grammatical or ungrammatical thereby exhibiting the same kind of discriminatory power provided by the Principles and Parameters linguistic framework, or Government-and-Binding theory. We have investigated the following models: feed-forward neural networks, Frasconi-Gori-Soda and Back-Tsoi locally recurrent neural networks, Williams and Zipser and Elman recurrent neural networks, Euclidean and editdistance nearest-neighbors, simulated annealing, and decision trees. Non-neural network machine learning methods are included primarily for comparison. Initial simulations were only partially successful by using a large temporal window as input to the models. Investigation indicated that success obtained this way did not imply that the models had learnt the grammar to a significant degree. Attempts to train networks with small temporal windows failed until we implemented several techniques aimed at avoiding local minima. We discuss the strengths and weaknesses of learning as compared to manual encoding, and we consider the similarities and differences between the various neural network and machine learning approaches. Also with Electrical and Computer Engineering, University of Queensland, St. Lucia Qld 4072, Australia. y Also with the Institute for Advanced Computer Studies, Univ ersity of Maryland, College Park, MD 20742. 1 Motivation 1.1 Language and Its Acquisition Certainly one of the most important questions for the study of human language is: How do people unfailingly manage to acquire such a complex rule system? A system so complex that it has resisted the efforts of linguists to date to adequately describe in a formal system (Chomsky 1986)? Here, we will provide a couple of examples of the kind of knowledge native speakers often take for granted. For instance, any native speaker of English knows that the adjectiveeagerobligatorily takes a complementizer for with a sentential complement that contains an overt subject, but that the verbelievecannot. Moreover, eager may take a sentential complement with a non-overt, i.e. an implied or understood, subject, but believecannot: 1 *I am eager John to be here I believe John to be here I am eager for John to be here *I believe for John to be here I am eager to be here *I believe to be here Such grammaticality judgments are sometimes subtle but unarguably form part of the native speaker’s language competence. In other cases, judgment falls not on acceptability but on other aspects of language competence such as interpretation. Consider the reference of the embedded subject of the predicate to talk to in the following examples: John is too stubborn for Mary to talk to John is too stubborn to talk to John is too stubborn to talk to Bill 1As is conventional, we use the asterisk to indicate ungramma ticality in these examples. In the first sentence, it is clear that Mary is the subject of the embedded predicate. As every native speaker knows, there is a strong contrast in the co-reference options for the understood subject in the second and third sentences despite their surface similarity. In the third sentence, John must be the implied subject of the predicate to talk to. By contrast,Johnis understood as the object of the predicate in the second sentence, the subject here having arbitrary reference; in other words, the sentence can be read as John is too stubborn for some arbitrary person to talk to John . The point we would like to emphasize here is that the language faculty has impressive discriminatory power, in the sense that a single word, as seen in the examples above, can result in sharp differences in acceptability or alter th interpretation of a sentence considerably. Furthermore, the judgments shown above are robust in the sense that virtually all native speakers will agree with the data. In the light of such examples and the fact that such contrasts crop up not just in English but in other languages (for example, thestubborncontrast also holds in Dutch), some linguists (chiefly Chomsky (Chomsky 1981)) have hypothesized that it is only reasonable that such knowledge is only partially acquired: the lack of variation found across speakers, and indeed, languages for certain classes of data suggests that there exists a fixed component of the language system. In other words, there is an innate component of the language faculty of the human mind that governs language processing. All languages obey these so-called universal principles. Since languages do differ with regard to things like subjectobject-verb order, these principles are subject to parameters encoding systematic variations found in particular languages. Under the innateness hypothesis, only the language parameters plus the language-specific lexicon are acquired by the speaker; in particular, the principles are not learned. Based on these assumptions, the study of these language-independent principles has become known as the Principles-and-Parameters framework, or Government-and-Binding (GB) theory. In this paper, we ask the question: Can a neural network be made to exhibit the same kind of discriminatory power on the data GB-linguists have examined? More precisely, the goal of the experiment is to train a neural net from scratch, i.e. without the bifurcation into learned vs. innate components assumed by Chomsky, to produce the same judgments as native speakers on the sharply grammatical/ungrammatical pairs of the sort discussed above. 1.2 Representational Power The most successful stochastic language models have been based on finite-state descriptions such as n-grams or hidden Markov models. However, finite-state models cannot represent hierarchical structures as found in natural language2 (Pereira 1992). In the past few years several recurrent neural network architectures have emerged which have been used for grammatical inference (Cleeremans, Servan-Schreiber & McClelland 1989, Giles, Sun, Chen, Lee & Chen 1990, Giles, Chen, Miller, Chen, Sun & Lee 1991, Giles, Miller, Chen, Chen, Sun & Lee 1992, Giles, Miller, Chen, Sun, Chen & Lee 1992). Do neural networks posses the power required for the task at hand? Yes, it has been shown that recurrent networks have the representational power required for hierarchical solutions (Elman 1991), and that they are Turing equivalent (Siegelmann & Sontag 1992). However, only recently has any work been successful with moderately large grammars. Recurrent neural networks have been used for several small natural language problems, e.g. papers using the Elman network for natural language tasks include: (Stolcke 1990, Allen 1983, Elman 1984, Harris & Elman 1984, John & McLelland 1990).

Sandiway Fong | C. Lee Giles | Steve Lawrence | S. Lawrence | Sandiway Fong