topicmodels : An R Package for Fitting Topic Models

Topic models allow the probabilistic modeling of term frequency occurrences in documents. The fitted model can be used to estimate the similarity between documents as well as between a set of specified keywords using an additional layer of latent variables which are referred to as topics. The R package topicmodels provides basic infrastructure for fitting topic models based on data structures from the text mining package tm. The package includes interfaces to two algorithms for fitting topic models: the variational expectation-maximization algorithm provided by David M. Blei and co-authors and an algorithm using Gibbs sampling by Xuan-Hieu Phan and co-authors.

[1]  Kurt Hornik,et al.  Text Mining Infrastructure in R , 2008 .

[2]  Max Welling,et al.  Distributed Algorithms for Topic Models , 2009, J. Mach. Learn. Res..

[3]  Martina Morris,et al.  statnet: Software Tools for the Representation, Visualization, Analysis and Simulation of Network Data. , 2008, Journal of statistical software.

[4]  Sebastian Thrun,et al.  Text Classification from Labeled and Unlabeled Documents using EM , 2000, Machine Learning.

[5]  Daniel Jurafsky,et al.  Studying the History of Ideas Using Topic Models , 2008, EMNLP.

[6]  Michael Goesele,et al.  Variational Bayes for Generic Topic Models , 2009, KI.

[7]  Mehran Sahami,et al.  Text Mining: Classification, Clustering, and Applications , 2009 .

[8]  Ruslan Salakhutdinov,et al.  Evaluation methods for topic models , 2009, ICML '09.

[9]  David M. Blei,et al.  Supervised Topic Models , 2007, NIPS.

[10]  Richard A. Harshman,et al.  Indexing by Latent Semantic Analysis , 1990, J. Am. Soc. Inf. Sci..

[11]  Thomas L. Griffiths,et al.  Probabilistic Topic Models , 2007 .

[12]  Donna K. Harman,et al.  Overview of the First Text REtrieval Conference (TREC-1) , 1992, TREC.

[13]  David M. Blei,et al.  Relational Topic Models for Document Networks , 2009, AISTATS.

[14]  W. Bruce Croft,et al.  LDA-based document models for ad-hoc retrieval , 2006, SIGIR.

[15]  Thomas L. Griffiths,et al.  Hierarchical Topic Models and the Nested Chinese Restaurant Process , 2003, NIPS.

[16]  Mark Steyvers,et al.  Finding scientific topics , 2004, Proceedings of the National Academy of Sciences of the United States of America.

[17]  Andrew Thomas,et al.  WinBUGS - A Bayesian modelling framework: Concepts, structure, and extensibility , 2000, Stat. Comput..

[18]  Wei Li,et al.  Mixtures of hierarchical topics with Pachinko allocation , 2007, ICML '07.

[19]  Thomas L. Griffiths,et al.  Learning author-topic models from text corpora , 2010, TOIS.

[20]  Ramesh Nallapati,et al.  Parallelized Variational EM for Latent Dirichlet Allocation: An Experimental Evaluation of Speed and Scalability , 2007, Seventh IEEE International Conference on Data Mining Workshops (ICDMW 2007).

[21]  Michael I. Jordan,et al.  Hierarchical Dirichlet Processes , 2006 .

[22]  M. Newton Approximate Bayesian-inference With the Weighted Likelihood Bootstrap , 1994 .

[23]  Kurt Hornik,et al.  A CLUE for CLUster Ensembles , 2005 .

[24]  Inderjit S. Dhillon,et al.  Clustering on the Unit Hypersphere using von Mises-Fisher Distributions , 2005, J. Mach. Learn. Res..

[25]  John D. Lafferty,et al.  Dynamic topic models , 2006, ICML.

[26]  Max Welling,et al.  Fast collapsed gibbs sampling for latent dirichlet allocation , 2008, KDD.

[27]  C. Elkan,et al.  Topic Models , 2008 .

[28]  Gregor Heinrich,et al.  A Generic Approach to Topic Models , 2009, ECML/PKDD.

[29]  John D. Lafferty,et al.  A correlated topic model of Science , 2007, 0708.3601.

[30]  Wei Li,et al.  Pachinko allocation: DAG-structured mixture models of topic correlations , 2006, ICML.

[31]  Xing Xie,et al.  Exploring LDA-Based Document Model for Geographic Information Retrieval , 2008, CLEF.

[32]  Edoardo M. Airoldi,et al.  Stochastic Block Models of Mixed Membership , 2006 .

[33]  Danielle S. McNamara,et al.  Handbook of latent semantic analysis , 2007 .

[34]  Michael I. Jordan,et al.  Latent Dirichlet Allocation , 2001, J. Mach. Learn. Res..

[35]  Francis R. Bach,et al.  Online Learning for Latent Dirichlet Allocation , 2010, NIPS.

[36]  Edoardo M. Airoldi,et al.  Mixed Membership Stochastic Blockmodels , 2007, NIPS.

[37]  David M. Blei,et al.  Connections between the lines: augmenting social networks with text , 2009, KDD.

[38]  Martyn Plummer,et al.  JAGS: A program for analysis of Bayesian graphical models using Gibbs sampling , 2003 .

[39]  Michael I. Jordan,et al.  Graphical Models, Exponential Families, and Variational Inference , 2008, Found. Trends Mach. Learn..

[40]  D. Rubin,et al.  Maximum likelihood from incomplete data via the EM - algorithm plus discussions on the paper , 1977 .

[41]  Susumu Horiguchi,et al.  Learning to classify short and sparse text & web with hidden topics from large-scale data collections , 2008, WWW.