Mining Infrastructure in R

During the last decade text mining has become a widely used discipline utilizing statistical and machine learning methods. We present the tm package which provides a framework for text mining applications within R. We give a survey on text mining facilities in R and explain how typical application tasks can be carried out using our framework. We present techniques for count-based analysis methods, text clustering, text classification and string kernels.

[1]  Roger D. Peng,et al.  INTERACTING WITH DATA USING THE FILEHASH PACKAGE FOR R , 2006 .

[2]  Yi-fang Brook Wu,et al.  eLearning assessment through textual analysis of class discussions , 2005, Fifth IEEE International Conference on Advanced Learning Technologies (ICALT'05).

[3]  George Karypis,et al.  Hierarchical Clustering Algorithms for Document Datasets , 2005, Data Mining and Knowledge Discovery.

[4]  Fridolin Wild,et al.  Automated Coding of Qualitative Interviews with Latent Semantic Analysis , 2007, ISTA.

[5]  Daniel Boley,et al.  Hierarchical Taxonomies using Divisive Partitioning , 1998 .

[6]  Dominique Haughton,et al.  A Review of Two Text-Mining Packages , 2005 .

[7]  Rafael A. Calvo,et al.  Mining Text with Pimiento , 2006, IEEE Internet Computing.

[8]  Beatrice Santorini,et al.  Building a Large Annotated Corpus of English: The Penn Treebank , 1993, CL.

[9]  Ingo Mierswa,et al.  YALE: rapid prototyping for complex data mining tasks , 2006, KDD '06.

[10]  Xiaolong Wang,et al.  Sequence analysis Application of latent semantic analysis to protein remote homology detection , 2006 .

[11]  George Karypis,et al.  A Comparison of Document Clustering Techniques , 2000 .

[12]  Kurt Hornik,et al.  kernlab - An S4 Package for Kernel Methods in R , 2004 .

[13]  Nello Cristianini,et al.  Classification using String Kernels , 2000 .

[14]  Robert Dale,et al.  Proceedings of the 37th annual meeting of the Association for Computational Linguistics on Computational Linguistics , 1999 .

[15]  Marti A. Hearst Untangling Text Data Mining , 1999, ACL.

[16]  Gunnar Rätsch,et al.  Large Scale Multiple Kernel Learning , 2006, J. Mach. Learn. Res..

[17]  W. B. Cavnar,et al.  N-gram-based text categorization , 1994 .

[18]  Nello Cristianini,et al.  Kernel Methods for Pattern Analysis , 2003, ICTAI.

[19]  Christina Gloeckner,et al.  Modern Applied Statistics With S , 2003 .

[20]  J. Shawe-Taylor,et al.  Using KCCA for Japanese-English cross-language information retrieval and classification , 2004 .

[21]  George Karypis,et al.  Topic-driven Clustering for Document Datasets , 2005, SDM.

[22]  David I. Holmes,et al.  Who Was the Author? An Introduction to Stylometry , 2003 .

[23]  Eric Brill,et al.  Transformation-Based Error-Driven Learning and Natural Language Processing: A Case Study in Part-of-Speech Tagging , 1995, CL.

[24]  K. Swedberg Who is an author? , 2008, European journal of heart failure.

[25]  S. C. Johnson Hierarchical clustering schemes , 1967, Psychometrika.

[26]  Alexandros Karatzoglou,et al.  Text Clustering with String Kernels in R , 2006, GfKl.

[27]  LinLei,et al.  Application of latent semantic analysis to protein remote homology detection , 2006 .

[28]  Michael R. Anderberg,et al.  Cluster Analysis for Applications , 1973 .

[29]  Philip Calvert,et al.  Design and Usability of Digital Libraries: Case Studies in the Asia Pacific , 2005 .

[30]  Michael I. Jordan,et al.  On Spectral Clustering: Analysis and an algorithm , 2001, NIPS.

[31]  C. Watkins Dynamic Alignment Kernels , 1999 .

[32]  John M. Chambers,et al.  Programming With Data , 1998 .

[33]  Nello Cristianini,et al.  An Introduction to Support Vector Machines and Other Kernel-based Learning Methods , 2000 .

[34]  Yiming Yang,et al.  RCV1: A New Benchmark Collection for Text Categorization Research , 2004, J. Mach. Learn. Res..

[35]  Ralf Herbrich,et al.  Learning Kernel Classifiers: Theory and Algorithms , 2001 .

[36]  Shigeaki Sakurai,et al.  An e-mail analysis method based on text mining techniques , 2005, Appl. Soft Comput..

[37]  Kurt Hornik,et al.  Text Mining of Supreme Administrative Court Jurisdictions , 2007, GfKl.

[38]  Kurt Hornik,et al.  A CLUE for CLUster Ensembles , 2005 .

[39]  Vipin Kumar,et al.  Partitioning-based clustering for Web document categorization , 1999, Decis. Support Syst..

[40]  Jean YH Yang,et al.  Bioconductor: open software development for computational biology and bioinformatics , 2004, Genome Biology.

[41]  J. MacQueen Some methods for classification and analysis of multivariate observations , 1967 .

[42]  R. Mooney,et al.  Impact of Similarity Measures on Web-page Clustering , 2000 .

[43]  Vincent Kanade,et al.  Clustering Algorithms , 2021, Wireless RF Energy Transfer in the Massive IoT Era.

[44]  David D. Lewis,et al.  Reuters-21578 Text Categorization Test Collection, Distribution 1.0 , 1997 .

[45]  Alexander J. Smola,et al.  Fast Kernels for String and Tree Matching , 2002, NIPS.

[46]  José Nilo G. Binongo,et al.  Who Wrote the 15th Book of Oz? An Application of Multivariate Analysis to Authorship Attribution , 2003 .

[47]  Tong Zhang,et al.  Text Mining: Predictive Methods for Analyzing Unstructured Information , 2004 .

[48]  George Karypis,et al.  Empirical and Theoretical Comparisons of Selected Criterion Functions for Document Clustering , 2004, Machine Learning.

[49]  Richard A. Harshman,et al.  Indexing by Latent Semantic Analysis , 1990, J. Am. Soc. Inf. Sci..

[50]  Alexander J. Smola,et al.  Learning with kernels , 1998 .

[51]  J. Hartigan Direct Clustering of a Data Matrix , 1972 .

[52]  J. Davis Bioinformatics and Computational Biology Solutions Using R and Bioconductor , 2007 .

[53]  Thorsten Joachims,et al.  Learning to classify text using support vector machines - methods, theory and algorithms , 2002, The Kluwer international series in engineering and computer science.

[54]  Michael W. Berry,et al.  Survey of Text Mining: Clustering, Classification, and Retrieval , 2007 .

[55]  Fabrizio Sebastiani,et al.  Machine learning in automated text categorization , 2001, CSUR.

[56]  J. Ginebra,et al.  Bayesian Analysis of a Multinomial Sequence and Homogeneity of Literary Style , 2005 .

[57]  Patrick F. Reidy An Introduction to Latent Semantic Analysis , 2009 .

[58]  Filip Radlinski,et al.  Active exploration for learning rankings from clickthrough data , 2007, KDD '07.

[59]  Carl Gutwin,et al.  KEA: practical automatic keyphrase extraction , 1999, DL '99.

[60]  Choon Hui Teo,et al.  Fast and space efficient string kernels using suffix arrays , 2006, ICML.

[61]  Martin F. Porter,et al.  An algorithm for suffix stripping , 1997, Program.