Introduction to the Special Issue on Computational Linguistics Using Large Corpora

The 1990s have witnessed a resurgence of interest in 1950s-style empirical and statistical methods of language analysis. Empiricism was at its peak in the 1950s, dominat ing a broad set of fields ranging from psychology (behaviorism) to electrical engineering (information theory). At that time, it was common practice in linguistics to classify words not only on the basis of their meanings but also on the basis of their cooccurrence with other words. Firth, a leading figure in British linguistics during the 1950s, summar ized the approach with the memorable line: "You shall know a word by the company it keeps" (Firth 1957). Regrettably, interest in empiricism faded in the late 1950s and early 1960s with a number of significant events including Chomsky 's criticism of n-grams in Syntactic Structures (Chomsky 1957) and Minsky and Papert 's criticism of neural networks in Perceptrons (Minsky and Papert 1969). Perhaps the most immediate reason for this empirical renaissance is the availability of massive quantities of data: more text is available than ever before. Just ten years ago, the one-million word Brown Corpus (Francis and Ku~era, 1982) was considered large, but even then, there were much larger corpora such as the Birmingham Corpus (Sinclair et al. 1987; Sinclair 1987). Today, many locations have samples of text running into the hundreds of millions or even billions of words. Collections of this magni tude are becoming widely available, thanks to data collection efforts such as the Association for Computat ional Linguistics' Data Collection Initiative (ACL/DCI), the European Corpus Initiative (ECI), ICAME, the British National Corpus (BNC), the Linguistic Data Consort ium (LDC), the Consort ium for Lexical Research (CLR), Electronic Dictionary Research (EDR), and standardization efforts such as the Text Encoding Initiative (TEI). 1 The data-intensive approach to language, which is becoming known as Text Analysis, takes a pragmatic approach that is well suited to meet the recent emphasis on numerical evaluations and concrete deliverables. Text Analysis focuses on broad ( though possibly superficial) coverage of unrestricted text, rather than deep analysis of (artificially) restricted domains.

[1]  John Sinclair,et al.  Looking up : an account of the COBUILD Project in lexical computing and the development of the Collins COBUILD English Language Dictionary , 1987 .

[2]  F. Mosteller,et al.  Inference and Disputed Authorship: The Federalist , 1966 .

[3]  Bernard Mérialdo,et al.  Natural Language Modeling for Phoneme-to-Text Transcription , 1986, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[4]  N. R. Dixon,et al.  Preliminary results on the performance of a system for the automatic recognition of continuous speech , 1976, ICASSP.

[5]  I. Good THE POPULATION FREQUENCIES OF SPECIES AND THE ESTIMATION OF POPULATION PARAMETERS , 1953 .

[6]  Gerald Salton,et al.  Automatic text processing , 1988 .

[7]  Carl de Marcken,et al.  Parsing the LOB Corpus , 1990, ACL.

[8]  Xabier Arregi,et al.  Towards Noun Homonym Disambiguation Using Local Context in Large Text Corpora , .

[9]  Julian Kupiec,et al.  Augmenting a Hidden Markov Model for Phrase-Dependent Word Tagging , 1989, HLT.

[10]  W. Bruce Croft,et al.  The use of phrases and structured queries in information retrieval , 1991, SIGIR '91.

[11]  B. Boguraev Book Reviews: Looking Up: An Account of the COBUILD PROJECT IN LEXICAL COMPUTING , 1990, CL.

[12]  Patti Price,et al.  The DARPA 1000-word resource management database for continuous speech recognition , 1988, ICASSP-88., International Conference on Acoustics, Speech, and Signal Processing.

[13]  John Cocke,et al.  A Statistical Approach to Machine Translation , 1990, CL.

[14]  Karen Kukich,et al.  Techniques for automatically correcting words in text , 1992, CSUR.

[15]  Douglas Biber,et al.  Representativeness in corpus design , 1993 .

[16]  J. Jenkins,et al.  Word association norms , 1964 .

[17]  L. R. Rasmussen,et al.  In information retrieval: data structures and algorithms , 1992 .

[18]  Geoffrey Leech,et al.  The Automatic Grammatical Tagging of the LOB Corpus , 1983 .

[19]  M. Braga,et al.  Exploratory Data Analysis , 2018, Encyclopedia of Social Network Analysis and Mining. 2nd Ed..

[20]  Mill Johannes G.A. Van,et al.  Transmission Of Information , 1961 .

[21]  B. Merialdo,et al.  Tagging text with a probabilistic model , 1991, [Proceedings] ICASSP 91: 1991 International Conference on Acoustics, Speech, and Signal Processing.

[22]  Richard M. Schwartz,et al.  Towards Understanding Text with a Very Large Vocabulary , 1990, HLT.

[23]  Gerard Salton,et al.  A Simple Syntactic Approach for the Generation of Indexing Phrases , 1990 .

[24]  Frederick Mosteller,et al.  Data Analysis and Regression , 1978 .

[25]  M. Baltin,et al.  The Mental representation of grammatical relations , 1985 .

[26]  Fred J. Damerau,et al.  A technique for computer detection and correction of spelling errors , 1964, CACM.

[27]  Kenneth Ward Church A Stochastic Parts Program and Noun Phrase Parser for Unrestricted Text , 1989, ANLP.

[28]  Satoshi Sato,et al.  Toward Memory-based Translation , 1990, COLING.

[29]  Robert L. Mercer,et al.  The Mathematics of Statistical Machine Translation: Parameter Estimation , 1993, CL.

[30]  Evelyne Tzoukermann,et al.  The BICORD System Combining Lexical Information from Bilingual Corpora and Machine Readable Dictionaries , 1990, COLING.

[31]  Marvin Minsky,et al.  Perceptrons: An Introduction to Computational Geometry , 1969 .

[32]  R. G. Leonard,et al.  A database for speaker-independent digit recognition , 1984, ICASSP.

[33]  J. R. Firth,et al.  A Synopsis of Linguistic Theory, 1930-1955 , 1957 .

[34]  William A. Woods,et al.  Augmented Transition Networks for Natural Language Analysis. , 1969 .

[35]  Rajeev Agarwal,et al.  Disambiguation of Prepositional Phrases in Automatically Labelled Technical Text , 1991, AAAI.

[36]  Fred Karlsson,et al.  Constraint Grammar as a Framework for Parsing Running Text , 1990, COLING.

[37]  Donald Hindle,et al.  Acquiring Disambiguation Rules from Text , 1989, ACL.

[38]  Lawrence R. Rabiner,et al.  A tutorial on hidden Markov models and selected applications in speech recognition , 1989, Proc. IEEE.

[39]  古井 貞煕,et al.  Digital speech processing, synthesis, and recognition , 1989 .

[40]  Aaron D. Wyner,et al.  Prediction and Entropy of Printed English , 1993 .

[41]  Thomas M. Cover,et al.  Elements of Information Theory , 2005 .

[42]  Lalit R. Bahl,et al.  Recognition of continuously read natural corpus , 1978, ICASSP.

[43]  James Pustejovsky,et al.  Lexical Semantic Techniques for Corpus Analysis , 1993, CL.

[44]  Frederick Jelinek,et al.  Interpolated estimation of Markov source parameters from sparse data , 1980 .

[45]  Patrick Hanks,et al.  Evidence and intuition in lexicography , 1990 .

[46]  Alex Waibel,et al.  Readings in speech recognition , 1990 .

[47]  R. Bakis Continuous speech recognition via centisecond acoustic states , 1976 .

[48]  Branimir Boguraev,et al.  Review of Looking up: an account of the COBUILD project in lexical computing by John M. Sinclair. Collins ELT 1987. , 1990 .

[49]  Michael R. Brent,et al.  From Grammar to Lexicon: Unsupervised Learning of Lexical Syntax , 1993, Comput. Linguistics.

[50]  Dennis H. Klatt,et al.  Review of the ARPA speech understanding project , 1990 .

[51]  L. Baum,et al.  An inequality and associated maximization technique in statistical estimation of probabilistic functions of a Markov process , 1972 .

[52]  Robert L. Mercer,et al.  An Estimate of an Upper Bound for the Entropy of English , 1992, CL.

[53]  William A. Woods,et al.  Computational Linguistics Transition Network Grammars for Natural Language Analysis , 2022 .

[54]  Steve Young,et al.  Applications of stochastic context-free grammars using the Inside-Outside algorithm , 1990 .

[55]  J. Baker Trainable grammars for speech recognition , 1979 .

[56]  D. Brenneis,et al.  Caught in the Web of Words , 1995 .

[57]  Amiel Feinstein,et al.  Transmission of Information. , 1962 .

[58]  Terry A. Welch,et al.  A Technique for High-Performance Data Compression , 1984, Computer.

[59]  Steven J. DeRose,et al.  Grammatical Category Disambiguation by Statistical Optimization , 1988, CL.

[60]  K. M. E. Murray Caught in the Web of Words: James Murray and the Oxford English Dictionary , 1977 .

[61]  R. Burchfield Frequency Analysis of English Usage: Lexicon and Grammar. By W. Nelson Francis and Henry Kučera with the assistance of Andrew W. Mackie. Boston: Houghton Mifflin. 1982. x + 561 , 1985 .

[62]  Julian M. Kupiec,et al.  Robust part-of-speech tagging using a hidden Markov model , 1992 .

[63]  Claude E. Shannon,et al.  Prediction and Entropy of Printed English , 1951 .

[64]  John Sinclair,et al.  Collins COBUILD English Language Dictionary , 1987 .