Text Categorization in R: A Reduced N-Gram Approach

For the majority of Natural Language Processing methods, identifying the language of the processed text is one of the key tasks. Corresponding Natural Language Processing techniques often have language specific conditions, i.e., selecting the correct stop word list or the correct set of rules for stemming. Among various different approaches for language identification or more generally, text categorization, a rather large proportion is based on the word N-gram approach pioneered by Cavnar and Trenkle. In this contribution we will show how to produce language and document profiles using a reduced version of Cavnar and Trenkle’s original algorithm. In addition, performance for N-gram based text classification employing both the original and the reduced approach, is compared. For this purpose, two groups of language profiles were used. One is composed of heterogeneous text data and the other one is solely based on articles from Wikipedia. Within this context we present the R package textcat. It enables the user to generate language profile databases as well as document profiles and allows to perform text classifications according to both the original and the reduced N-gram approach.