A tm Plug-In for Distributed Text Mining in R

R has gained explicit text mining support with the tm package enabling statisticians to answer many interesting research questions via statistical analysis or modeling of (text) corpora. However, we typically face two challenges when analyzing large corpora: (1) the amount of data to be processed in a single machine is usually limited by the available main memory (i.e., RAM), and (2) the more data to be analyzed the higher the need for efficient procedures for calculating valuable results. Fortunately, adequate programming models like MapReduce facilitate parallelization of text mining tasks and allow for processing data sets beyond what would fit into memory by using a distributed file system possibly spanning over several machines, e.g., in a cluster of workstations. In this paper we present a plug-in package to tm called tm.plugin.dc implementing a distributed corpus class which can take advantage of the Hadoop MapReduce library for large scale text mining tasks. We show on the basis of an application in culturomics that we can efficiently handle data sets of significant size.

[1]  G. Amdhal,et al.  Validity of the single processor approach to achieving large scale computing capabilities , 1967, AFIPS '67 (Spring).

[2]  John D. Lafferty,et al.  A correlated topic model of Science , 2007, 0708.3601.

[3]  Kurt Hornik,et al.  Text Mining Infrastructure in R , 2008 .

[4]  David D. Lewis,et al.  Reuters-21578 Text Categorization Test Collection, Distribution 1.0 , 1997 .

[5]  Alekh Jindal,et al.  Hadoop++ , 2010 .

[6]  Christopher Chute,et al.  The Diverse and Exploding Digital Universe , 2011 .

[7]  Matt Zandstra,et al.  Version Control with Subversion , 2010 .

[8]  R Core Team,et al.  R: A language and environment for statistical computing. , 2014 .

[9]  Alex Rapaport,et al.  Mpi-2: extensions to the message-passing interface , 1997 .

[10]  Luiz André Barroso,et al.  Web Search for a Planet: The Google Cluster Architecture , 2003, IEEE Micro.

[11]  Martin F. Porter,et al.  An algorithm for suffix stripping , 1997, Program.

[12]  Message Passing Interface Forum MPI: A message - passing interface standard , 1994 .

[13]  Elly van Gelderen,et al.  A History of the English Language , 2000 .

[14]  Ralf Lämmel,et al.  Google's MapReduce programming model - Revisited , 2007, Sci. Comput. Program..

[15]  Alexander Dekhtyar,et al.  Information Retrieval , 2018, Lecture Notes in Computer Science.

[16]  Hao Yu,et al.  State of the Art in Parallel Computing with R , 2009 .

[17]  Na Li,et al.  Simple Parallel Statistical Computing in R , 2007 .

[18]  George A. Miller,et al.  WordNet: A Lexical Database for English , 1995, HLT.

[19]  Björn-Olav Dozo,et al.  Quantitative Analysis of Culture Using Millions of Digitized Books , 2010 .

[20]  Paul C. Tetlock Giving Content to Investor Sentiment: The Role of Media in the Stock Market , 2005, The Journal of Finance.

[21]  Mark Steyvers,et al.  Finding scientific topics , 2004, Proceedings of the National Academy of Sciences of the United States of America.

[22]  Hinrich Schütze,et al.  Introduction to information retrieval , 2008 .

[23]  Yiming Yang,et al.  RCV1: A New Benchmark Collection for Text Categorization Research , 2004, J. Mach. Learn. Res..

[24]  Sanjay Ghemawat,et al.  MapReduce: Simplified Data Processing on Large Clusters , 2004, OSDI.

[25]  GhemawatSanjay,et al.  The Google file system , 2003 .

[26]  Christopher D. Manning,et al.  Introduction to Information Retrieval , 2010, J. Assoc. Inf. Sci. Technol..

[27]  W. Nelson Francis,et al.  FREQUENCY ANALYSIS OF ENGLISH USAGE: LEXICON AND GRAMMAR , 1983 .