Text Mining Infrastructure in R

During the last decade text mining has become a widely used discipline utilizing statistical and machine learning methods. We present the tm package which provides a framework for text mining applications within R. We give a survey on text mining facilities in R and explain how typical application tasks can be carried out using our framework. We present techniques for count-based analysis methods, text clustering, text classification and string kernels.

[1]  J. A. Hartigan,et al.  A k-means clustering algorithm , 1979 .

[2]  C. Watkins Dynamic Alignment Kernels , 1999 .

[3]  John M. Chambers,et al.  Programming With Data , 1998 .

[4]  Gunnar Rätsch,et al.  Large Scale Multiple Kernel Learning , 2006, J. Mach. Learn. Res..

[5]  S. C. Johnson Hierarchical clustering schemes , 1967, Psychometrika.

[6]  Nello Cristianini,et al.  An Introduction to Support Vector Machines and Other Kernel-based Learning Methods , 2000 .

[7]  W. B. Cavnar,et al.  N-gram-based text categorization , 1994 .

[8]  Alexandros Karatzoglou,et al.  Text Clustering with String Kernels in R , 2006, GfKl.

[9]  Alexander J. Smola,et al.  Fast Kernels for String and Tree Matching , 2002, NIPS.

[10]  Kurt Hornik,et al.  Text Mining of Supreme Administrative Court Jurisdictions , 2007, GfKl.

[11]  Kurt Hornik,et al.  A CLUE for CLUster Ensembles , 2005 .

[12]  Ralf Herbrich,et al.  Learning Kernel Classifiers: Theory and Algorithms , 2001 .

[13]  José Nilo G. Binongo,et al.  Who Wrote the 15th Book of Oz? An Application of Multivariate Analysis to Authorship Attribution , 2003 .

[14]  Vipin Kumar,et al.  Partitioning-based clustering for Web document categorization , 1999, Decis. Support Syst..

[15]  Tong Zhang,et al.  Text Mining: Predictive Methods for Analyzing Unstructured Information , 2004 .

[16]  George Karypis,et al.  Empirical and Theoretical Comparisons of Selected Criterion Functions for Document Clustering , 2004, Machine Learning.

[17]  Rafael A. Irizarry,et al.  Bioinformatics and Computational Biology Solutions using R and Bioconductor , 2005 .

[18]  Marc Toussaint,et al.  Probabilistic inference for solving discrete and continuous state Markov Decision Processes , 2006, ICML.

[19]  John Shawe-Taylor,et al.  Using KCCA for Japanese–English cross-language information retrieval and document classification , 2006, Journal of Intelligent Information Systems.

[20]  Kurt Hornik,et al.  kernlab - An S4 Package for Kernel Methods in R , 2004 .

[21]  Nello Cristianini,et al.  Kernel Methods for Pattern Analysis , 2006 .

[22]  Fridolin Wild,et al.  Automated Coding of Qualitative Interviews with Latent Semantic Analysis , 2007, ISTA.

[23]  Eric Brill,et al.  Transformation-Based Error-Driven Learning and Natural Language Processing: A Case Study in Part-of-Speech Tagging , 1995, CL.

[24]  James A. Hendler,et al.  The Semantic Web" in Scientific American , 2001 .

[25]  Choon Hui Teo,et al.  Fast and space efficient string kernels using suffix arrays , 2006, ICML.

[26]  John A. Hartigan,et al.  Clustering Algorithms , 1975 .

[27]  Shigeaki Sakurai,et al.  An e-mail analysis method based on text mining techniques , 2005, Appl. Soft Comput..

[28]  Christiane Fellbaum,et al.  Book Reviews: WordNet: An Electronic Lexical Database , 1999, CL.

[29]  Marti A. Hearst Untangling Text Data Mining , 1999, ACL.

[30]  Xiaolong Wang,et al.  Sequence analysis Application of latent semantic analysis to protein remote homology detection , 2006 .

[31]  Nello Cristianini,et al.  Classification using String Kernels , 2000 .

[32]  Christian Buchta,et al.  Distance and Similarity Measures , 2015, Encyclopedia of Multimedia.

[33]  Michael I. Jordan,et al.  On Spectral Clustering: Analysis and an algorithm , 2001, NIPS.

[34]  A C C Gibbs,et al.  Data Analysis , 2009, Encyclopedia of Database Systems.

[35]  Yiming Yang,et al.  RCV1: A New Benchmark Collection for Text Categorization Research , 2004, J. Mach. Learn. Res..

[36]  Thorsten Joachims,et al.  Learning to classify text using support vector machines - methods, theory and algorithms , 2002, The Kluwer international series in engineering and computer science.

[37]  Rafael A. Calvo,et al.  Mining Text with Pimiento , 2006, IEEE Internet Computing.

[38]  Beatrice Santorini,et al.  Building a Large Annotated Corpus of English: The Penn Treebank , 1993, CL.

[39]  Ingo Mierswa,et al.  YALE: rapid prototyping for complex data mining tasks , 2006, KDD '06.

[40]  George Karypis,et al.  A Comparison of Document Clustering Techniques , 2000 .

[41]  M. F. Porter,et al.  An algorithm for suffix stripping , 1997 .

[42]  J. Hartigan Direct Clustering of a Data Matrix , 1972 .

[43]  Roger D. Peng,et al.  INTERACTING WITH DATA USING THE FILEHASH PACKAGE FOR R , 2006 .

[44]  Michael W. Berry,et al.  Survey of Text Mining: Clustering, Classification, and Retrieval , 2007 .

[45]  George Karypis,et al.  Hierarchical Clustering Algorithms for Document Datasets , 2005, Data Mining and Knowledge Discovery.

[46]  J. Shawe-Taylor,et al.  Using KCCA for Japanese-English cross-language information retrieval and classification , 2004 .

[47]  Daniel Boley,et al.  Hierarchical Taxonomies using Divisive Partitioning , 1998 .

[48]  David D. Lewis,et al.  Reuters-21578 Text Categorization Test Collection, Distribution 1.0 , 1997 .

[49]  S. Dumais Latent Semantic Analysis. , 2005 .

[50]  Jean YH Yang,et al.  Bioconductor: open software development for computational biology and bioinformatics , 2004, Genome Biology.

[51]  Fabrizio Sebastiani,et al.  Machine learning in automated text categorization , 2001, CSUR.

[52]  J. Ginebra,et al.  Bayesian Analysis of a Multinomial Sequence and Homogeneity of Literary Style , 2005 .

[53]  Ingo Feinerer Introduction to the tm Package Text Mining in R , 2007 .

[54]  J. MacQueen Some methods for classification and analysis of multivariate observations , 1967 .

[55]  George Karypis,et al.  Topic-driven Clustering for Document Datasets , 2005, SDM.

[56]  David I. Holmes,et al.  Who Was the Author? An Introduction to Stylometry , 2003 .

[57]  H. L. Le Roy,et al.  Proceedings of the Fifth Berkeley Symposium on Mathematical Statistics and Probability; Vol. IV , 1969 .

[58]  Filip Radlinski,et al.  Active exploration for learning rankings from clickthrough data , 2007, KDD '07.

[59]  Yi-fang Brook Wu,et al.  eLearning assessment through textual analysis of class discussions , 2005, Fifth IEEE International Conference on Advanced Learning Technologies (ICALT'05).

[60]  Dominique Haughton,et al.  A Review of Two Text-Mining Packages , 2005 .

[61]  R. Mooney,et al.  Impact of Similarity Measures on Web-page Clustering , 2000 .

[62]  Mirina Grosz,et al.  World Wide Web Consortium , 2010 .

[63]  Carl Gutwin,et al.  KEA: practical automatic keyphrase extraction , 1999, DL '99.

[64]  T. Landauer,et al.  Indexing by Latent Semantic Analysis , 1990 .

[65]  Christina Gloeckner,et al.  Modern Applied Statistics With S , 2003 .

[66]  Michael R. Anderberg,et al.  Cluster Analysis for Applications , 1973 .

[67]  Peter W. Foltz,et al.  An introduction to latent semantic analysis , 1998 .

[68]  Bernhard Schölkopf,et al.  Dynamic Alignment Kernels , 2000 .

[69]  Peter Willett,et al.  Readings in information retrieval , 1997 .

[70]  Alexander J. Smola,et al.  Learning with kernels , 1998 .