R has recently gained explicit text mining support with the "tm" package enabling statisticians to answer many interesting research questions via statistical analysis or modeling of (text) corpora. However, we typically face two challenges when analyzing large corpora: (1) the amount of data to be processed in a single machine is usually limited by the available main memory (i.e., RAM), and (2) an increase of the amount of data to be analyzed leads to increasing computational workload. Fortunately,
adequate parallel programming models like MapReduce and the
corresponding open source implementation called Hadoop allow for processing data sets beyond what would fit into memory.
In this paper we present the package "tm.plugin.dc" offering a seamless integration between "tm" and Hadoop. We show on the basis of an application in culturomics that we
can efficiently handle data sets of significant size.
[1]
V. Pareto.
La courbe de la répartition de la richesse
,
1967
.
[2]
R. Jackson,et al.
The Matthew Effect in Science
,
1988,
International journal of dermatology.
[3]
Joseph Persky,et al.
Retrospectives: Pareto's Law
,
1992
.
[4]
Stephen E. Margolis,et al.
Network externality : an uncommon tragedy
,
1994
.
[5]
P. Bourdieu.
Forms of Capital
,
2002
.
[6]
C. Panico,et al.
Myrdal, Growth Processes and Equilibrium Theories
,
2009
.
[7]
James A. Robinson,et al.
Foundations of Societal Inequality
,
2009,
Science.
[8]
Strukturiert kulturelles Kapital auch den Konsum von Populärkultur? / Is the Consumption of Popular Culture Structured by Cultural Capital as well?
,
2009
.
[9]
David A. Nolin,et al.
Intergenerational Wealth Transmission and the Dynamics of Inequality in Small-Scale Societies
,
2009,
Science.
[10]
P. Bourdieu.
Ökonomisches Kapital, kulturelles Kapital, soziales Kapital
,
2012
.
[11]
M. Lutter.
Anstieg oder Ausgleich? Die multiplikative Wirkung sozialer Ungleichheiten auf dem Arbeitsmarkt für Filmschauspieler / More or Less? Multiplicative Effects of Inequality on the Labor Market for Film Actors
,
2012
.