论文信息 - Introduction to the Special Issue on Computational Linguistics Using Large Corpora

Introduction to the Special Issue on Computational Linguistics Using Large Corpora

The 1990s have witnessed a resurgence of interest in 1950s-style empirical and statistical methods of language analysis. Empiricism was at its peak in the 1950s, dominat ing a broad set of fields ranging from psychology (behaviorism) to electrical engineering (information theory). At that time, it was common practice in linguistics to classify words not only on the basis of their meanings but also on the basis of their cooccurrence with other words. Firth, a leading figure in British linguistics during the 1950s, summar ized the approach with the memorable line: "You shall know a word by the company it keeps" (Firth 1957). Regrettably, interest in empiricism faded in the late 1950s and early 1960s with a number of significant events including Chomsky 's criticism of n-grams in Syntactic Structures (Chomsky 1957) and Minsky and Papert 's criticism of neural networks in Perceptrons (Minsky and Papert 1969). Perhaps the most immediate reason for this empirical renaissance is the availability of massive quantities of data: more text is available than ever before. Just ten years ago, the one-million word Brown Corpus (Francis and Ku~era, 1982) was considered large, but even then, there were much larger corpora such as the Birmingham Corpus (Sinclair et al. 1987; Sinclair 1987). Today, many locations have samples of text running into the hundreds of millions or even billions of words. Collections of this magni tude are becoming widely available, thanks to data collection efforts such as the Association for Computat ional Linguistics' Data Collection Initiative (ACL/DCI), the European Corpus Initiative (ECI), ICAME, the British National Corpus (BNC), the Linguistic Data Consort ium (LDC), the Consort ium for Lexical Research (CLR), Electronic Dictionary Research (EDR), and standardization efforts such as the Text Encoding Initiative (TEI). 1 The data-intensive approach to language, which is becoming known as Text Analysis, takes a pragmatic approach that is well suited to meet the recent emphasis on numerical evaluations and concrete deliverables. Text Analysis focuses on broad ( though possibly superficial) coverage of unrestricted text, rather than deep analysis of (artificially) restricted domains.

Kenneth Ward Church | Robert L. Mercer | R. Mercer

[1] John Sinclair,et al. Looking up : an account of the COBUILD Project in lexical computing and the development of the Collins COBUILD English Language Dictionary , 1987 .

[2] F. Mosteller,et al. Inference and Disputed Authorship: The Federalist , 1966 .

[3] Bernard Mérialdo,et al. Natural Language Modeling for Phoneme-to-Text Transcription , 1986, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[4] N. R. Dixon,et al. Preliminary results on the performance of a system for the automatic recognition of continuous speech , 1976, ICASSP.

[5] I. Good. THE POPULATION FREQUENCIES OF SPECIES AND THE ESTIMATION OF POPULATION PARAMETERS , 1953 .

[6] Gerald Salton,et al. Automatic text processing , 1988 .

[7] Carl de Marcken,et al. Parsing the LOB Corpus , 1990, ACL.

[8] Xabier Arregi,et al. Towards Noun Homonym Disambiguation Using Local Context in Large Text Corpora , .

[9] Julian Kupiec,et al. Augmenting a Hidden Markov Model for Phrase-Dependent Word Tagging , 1989, HLT.

[10] W. Bruce Croft,et al. The use of phrases and structured queries in information retrieval , 1991, SIGIR '91.

[11] B. Boguraev. Book Reviews: Looking Up: An Account of the COBUILD PROJECT IN LEXICAL COMPUTING , 1990, CL.

[12] Patti Price,et al. The DARPA 1000-word resource management database for continuous speech recognition , 1988, ICASSP-88., International Conference on Acoustics, Speech, and Signal Processing.

[13] John Cocke,et al. A Statistical Approach to Machine Translation , 1990, CL.

[14] Karen Kukich,et al. Techniques for automatically correcting words in text , 1992, CSUR.

[15] Douglas Biber,et al. Representativeness in corpus design , 1993 .

[16] J. Jenkins,et al. Word association norms , 1964 .

[17] L. R. Rasmussen,et al. In information retrieval: data structures and algorithms , 1992 .

[18] Geoffrey Leech,et al. The Automatic Grammatical Tagging of the LOB Corpus , 1983 .

[19] M. Braga,et al. Exploratory Data Analysis , 2018, Encyclopedia of Social Network Analysis and Mining. 2nd Ed..

[20] Mill Johannes G.A. Van,et al. Transmission Of Information , 1961 .

[21] B. Merialdo,et al. Tagging text with a probabilistic model , 1991, [Proceedings] ICASSP 91: 1991 International Conference on Acoustics, Speech, and Signal Processing.

[22] Richard M. Schwartz,et al. Towards Understanding Text with a Very Large Vocabulary , 1990, HLT.

[23] Gerard Salton,et al. A Simple Syntactic Approach for the Generation of Indexing Phrases , 1990 .

[24] Frederick Mosteller,et al. Data Analysis and Regression , 1978 .

[25] M. Baltin,et al. The Mental representation of grammatical relations , 1985 .

[26] Fred J. Damerau,et al. A technique for computer detection and correction of spelling errors , 1964, CACM.

[27] Kenneth Ward Church. A Stochastic Parts Program and Noun Phrase Parser for Unrestricted Text , 1989, ANLP.

[28] Satoshi Sato,et al. Toward Memory-based Translation , 1990, COLING.

[29] Robert L. Mercer,et al. The Mathematics of Statistical Machine Translation: Parameter Estimation , 1993, CL.

[30] Evelyne Tzoukermann,et al. The BICORD System Combining Lexical Information from Bilingual Corpora and Machine Readable Dictionaries , 1990, COLING.

[31] Marvin Minsky,et al. Perceptrons: An Introduction to Computational Geometry , 1969 .

[32] R. G. Leonard,et al. A database for speaker-independent digit recognition , 1984, ICASSP.

[33] J. R. Firth,et al. A Synopsis of Linguistic Theory, 1930-1955 , 1957 .