The entropy of English using PPM-based models

The purpose of this paper is to show that the difference between the best machine models and human models is smaller than might be indicated by the previous results. This follows from a number of observations: firstly, the original human experiments used only 27 character English (letters plus space) against full 128 character ASCII text for most computer experiments; secondly, using large amounts of priming text substantially improves the PPM's performance; and thirdly, the PPM algorithm can be modified to perform better for English text. The result of this is a machine performance down to 1.46 bit per character. The problem of estimating the entropy of English is discussed. The importance of training text for PPM is demonstrated, showing that its performance can be improved by "adjusting" the alphabet used. The results based on these improvements are then given, with compression down to 1.46 bpc.

[1]  N. S. Barnett,et al.  Private communication , 1969 .

[2]  J. Cleary,et al.  \self-organized Language Modeling for Speech Recognition". In , 1997 .

[3]  Gerald Salton,et al.  Automatic text processing , 1988 .

[4]  D. J. Wheeler,et al.  A Block-sorting Lossless Data Compression Algorithm , 1994 .

[5]  Rajeev Agarwal,et al.  Disambiguation of Prepositional Phrases in Automatically Labelled Technical Text , 1991, AAAI.

[6]  P. Denes,et al.  The speech chain : the physics and biology of spoken language , 1963 .

[7]  Dumas Malone,et al.  Jefferson the Virginian , 1948 .

[8]  Alistair Moffat,et al.  Implementing the PPM data compression scheme , 1990, IEEE Trans. Commun..

[9]  John G. Cleary,et al.  Unbounded length contexts for PPM , 1995, Proceedings DCC '95 Data Compression Conference.

[10]  Henk C. A. van Tilborg,et al.  An Introduction to Cryptology , 1988 .

[11]  Robert L. Mercer,et al.  An Estimate of an Upper Bound for the Entropy of English , 1992, CL.

[12]  William J. Wilson Chinks in the armor of public key cryptosystems , 1994 .

[13]  P. Fenwick Improvements to the Block Sorting Text Compression Algorithm , 1995 .

[14]  R. Burchfield Frequency Analysis of English Usage: Lexicon and Grammar. By W. Nelson Francis and Henry Kučera with the assistance of Andrew W. Mackie. Boston: Houghton Mifflin. 1982. x + 561 , 1985 .

[15]  Claude E. Shannon,et al.  Prediction and Entropy of Printed English , 1951 .

[16]  C. E. SHANNON,et al.  A mathematical theory of communication , 1948, MOCO.

[17]  W. Teahan Probability estimation for PPM , 1995 .

[18]  Thomas M. Cover,et al.  A convergent gambling estimate of the entropy of English , 1978, IEEE Trans. Inf. Theory.