Finding Hotspots in Document Collection

Given a document collection, it is often desirable to find the core subset of documents focusing on a specific topic. We propose a new algorithm for this task. Document clustering aims at partitioning the document-term datasets into different groups by optimizing certain objective functions. However, they are not suitable for finding hotspots that are described by a small set of documents with few tightly coupled terms. In this paper we propose a novel hot spot finding algorithm, DCC (Dense Concept Clustering) in document collections. DCC can extract distinct small topics with most representative documents and words simultaneously. The hotspots are dense bicliques in binary document-word matrices and they can be discovered sequentially one at a time using the generalized Motzkin-Straus formalism. The representative documents and words are tightly correlated for concept descriptions. Experiments on real document datasets show the effectiveness of the proposed algorithm.

[1]  Tao Li,et al.  Document clustering via adaptive subspace iteration , 2004, SIGIR '04.

[2]  Tao Li,et al.  A general model for clustering binary data , 2005, KDD '05.

[3]  Inderjit S. Dhillon,et al.  Minimum Sum-Squared Residue Co-Clustering of Gene Expression Data , 2004, SDM.

[4]  J. Hartigan Direct Clustering of a Data Matrix , 1972 .

[5]  Vipin Kumar,et al.  WebACE: a Web agent for document categorization and exploration , 1998, AGENTS '98.

[6]  Inderjit S. Dhillon,et al.  Co-clustering documents and words using bipartite spectral graph partitioning , 2001, KDD '01.

[7]  Chris H. Q. Ding,et al.  Biclustering Protein Complex Interactions with a Biclique Finding Algorithm , 2006, Sixth International Conference on Data Mining (ICDM'06).

[8]  George M. Church,et al.  Biclustering of Expression Data , 2000, ISMB.

[9]  Panos M. Pardalos,et al.  The maximum clique problem , 1994, J. Glob. Optim..

[10]  M. Pelillo Relaxation labeling networks for the maximum clique problem , 1996 .

[11]  Philip S. Yu,et al.  Enhanced biclustering on expression data , 2003, Third IEEE Symposium on Bioinformatics and Bioengineering, 2003. Proceedings..

[12]  George Karypis,et al.  Empirical and Theoretical Comparisons of Selected Criterion Functions for Document Clustering , 2004, Machine Learning.

[13]  Inderjit S. Dhillon,et al.  Information-theoretic co-clustering , 2003, KDD '03.

[14]  Chris H. Q. Ding,et al.  Bipartite graph partitioning and data clustering , 2001, CIKM '01.

[15]  Arlindo L. Oliveira,et al.  Biclustering algorithms for biological data analysis: a survey , 2004, IEEE/ACM Transactions on Computational Biology and Bioinformatics.

[16]  Philip S. Yu,et al.  Co-clustering by block value decomposition , 2005, KDD '05.

[17]  Panos M. Pardalos,et al.  Continuous Characterizations of the Maximum Clique Problem , 1997, Math. Oper. Res..

[18]  Philip S. Yu,et al.  Fast algorithms for projected clustering , 1999, SIGMOD '99.

[19]  Huan Liu,et al.  Subspace clustering for high dimensional data: a review , 2004, SKDD.