The complexity and entropy of literary styles

Since Shannon's original experiment in 1951, several methods have been applied to the problem of determining the entropy of English text. These methods were based either on prediction by human subjects, or on computer-implemented parametric models for the data, of a certain Markov order. We ask why computer-based experiments almost always yield much higher entropy estimates than the ones produced by humans. We argue that there are two main reasons for this discrepancy. First, the long-range correlations of English text are not captured by Markovian models and, second, computerbased models only take advantage of the text statistics without being able to \understand" the contextual structure and the semantics of the given text. The second question we address is what does the \entropy" of a text say about the author's literary style. In particular, is there an intuitive notion of \complexity of style" that is captured by the entropy? We present preliminary results based on a non-parametric entropy estimation algorithm that o er partial answers to these questions. These results indicate that taking long-range correlations into account signi cantly improves the entropy estimates. We get an estimate of 1.77 bits-per-character for a onemillion-character sample taken from Jane Austen's works. Also comparing the estimates obtained from several di erent texts provides some insight into the interpretation of the notion of \entropy" when applied to English text rather than to random processes, and the relationship between the entropy and the \literary complexity" of an author's style. Advantages of this entropy estimation method are that it does not require prior training, it is uniformly good over di erent styles and languages, and it seems to converge reasonably fast. This paper was submitted as a term-project for the \Special Topics in Information Theory" class EE478 taught by Prof. Tom Cover (EE Dept., Stanford Univ.) during Spring 1996. This work was supported in part by grants NSF #NCR-9205663, JSEP #DAAH04-94-G-0058, ARPA #J-FBI-94-218-2. I. Kontoyiannis is with the Information Systems Laboratory Durand Bldg 141A, Stanford University, Stanford CA 94305. Email: yiannis@isl.stanford.edu

[1]  Claude E. Shannon,et al.  Prediction and Entropy of Printed English , 1951 .

[2]  E. B. Newman Men and information: a psychologist’s view , 1959 .

[3]  Edwin B. Newman,et al.  The Redundancy of Texts in Three Languages , 1960, Inf. Control..

[4]  William Paisley,et al.  The effects of authorship, topic, structure, and time of composition on letter redundancy in English texts , 1966 .

[5]  Dean Jamison,et al.  A Note on the Entropy of Partially-Known Languages , 1968, Inf. Control..

[6]  A. Kolmogorov Three approaches to the quantitative definition of information , 1968 .

[7]  N. S. Barnett,et al.  Private communication , 1969 .

[8]  K. Weltner The Measurement of Verbal Information in Psychology and Education , 1973 .

[9]  Abraham Lempel,et al.  A universal algorithm for sequential data compression , 1977, IEEE Trans. Inf. Theory.

[10]  Thomas M. Cover,et al.  A convergent gambling estimate of the entropy of English , 1978, IEEE Trans. Inf. Theory.

[11]  Abraham Lempel,et al.  Compression of individual sequences via variable-rate coding , 1978, IEEE Trans. Inf. Theory.

[12]  P. Gács,et al.  KOLMOGOROV'S CONTRIBUTIONS TO INFORMATION THEORY AND ALGORITHMIC COMPLEXITY , 1989 .

[13]  Aaron D. Wyner,et al.  Some asymptotic properties of the entropy of a stationary ergodic data source with applications to data compression , 1989, IEEE Trans. Inf. Theory.

[14]  Alistair Moffat,et al.  Implementing the PPM data compression scheme , 1990, IEEE Trans. Commun..

[15]  Thomas M. Cover,et al.  Elements of Information Theory , 2005 .

[16]  Robert L. Mercer,et al.  An Estimate of an Upper Bound for the Entropy of English , 1992, CL.

[17]  Benjamin Weiss,et al.  Entropy and data compression schemes , 1993, IEEE Trans. Inf. Theory.

[18]  John H. Reif,et al.  Using difficulty of prediction to decrease computation: fast sort, priority queue and convex hull on entropy bounded inputs , 1993, Proceedings of 1993 IEEE 34th Annual Foundations of Computer Science.

[19]  Ioannis Kontoyiannis,et al.  Prefixes and the entropy rate for long-range sources , 1994, Proceedings of 1994 IEEE International Symposium on Information Theory.

[20]  Benoist,et al.  On the Entropy of DNA: Algorithms and Measurements based on Memory and Rapid Convergence , 1994 .

[21]  John G. Cleary,et al.  The entropy of English using PPM-based models , 1996, Proceedings of Data Compression Conference - DCC '96.

[22]  Yuri M. Suhov,et al.  Stationary entrophy estimation via string matching , 1996, Proceedings of Data Compression Conference - DCC '96.

[23]  J. Cleary,et al.  \self-organized Language Modeling for Speech Recognition". In , 1997 .