n-Grams and their implication to natural language understanding

Abstract This paper presents the results of a comparative information-theoretic study that was carried out between Greek and English texts. The rank frequency correlation ( p ) between the two appears to be very high; the correlation is between 0.915 and 0.989. The results also include positional letter analyses, n -gram analyses, word analyses, empirical semantic correlations between the Greek and English n -grams, and entropy calculations. The findings presented here are of interest to researchers in the fields of natural language understanding, text processing and compression, speech synthesis and recognition as well as error detection and correction. The results are interesting because they encompass the complete range of hierarchic text patterns (i.e. letters, n -grams (or sub-word patterns) and words).