Effect of Dimensionality Reduction on Different Distance Measures in Document Clustering

In document clustering, semantically similar documents are grouped together. The dimensionality of document collections is often very large, thousands or tens of thousands of terms. Thus, it is common to reduce the original dimensionality before clustering for computational reasons. Cosine distance is widely seen as the best choice for measuring the distances between documents in k-means clustering. In this paper, we experiment three dimensionality reduction methods with a selection of distance measures and show that after dimensionality reduction into small target dimensionalities, such as 10 or below, the superiority of cosine measure does not hold anymore. Also, for small dimensionalities, PCA dimensionality reduction method performs better than SVD. We also show how l 2 normalization affects different distance measures. The experiments are run for three document sets in English and one in Hindi.

[1]  Anna-Lan Huang,et al.  Similarity Measures for Text Document Clustering , 2008 .

[2]  Ulrike von Luxburg,et al.  A tutorial on spectral clustering , 2007, Stat. Comput..

[3]  Sule Gündüz Ögüdücü,et al.  Comparison of similarity measures for clustering Turkish documents , 2009, Intell. Data Anal..

[4]  Elena Deza,et al.  Encyclopedia of Distances , 2014 .

[5]  K. R. Clarke,et al.  On resemblance measures for ecological studies, including taxonomic dissimilarities and a zero-adjusted Bray–Curtis coefficient for denuded assemblages , 2006 .

[6]  T. Landauer,et al.  Indexing by Latent Semantic Analysis , 1990 .

[7]  Alexandros Nanopoulos,et al.  On the existence of obstinate results in vector space models , 2010, SIGIR.

[8]  Francisco Escolano,et al.  Graph-Based Representations in Pattern Recognition, 6th IAPR-TC-15 International Workshop, GbRPR 2007, Alicante, Spain, June 11-13, 2007, Proceedings , 2007, GbRPR.

[9]  C. J. van Rijsbergen,et al.  The use of hierarchic clustering in information retrieval , 1971, Inf. Storage Retr..

[10]  George Karypis,et al.  Hierarchical Clustering Algorithms for Document Datasets , 2005, Data Mining and Knowledge Discovery.

[11]  T. Kohonen,et al.  Workshop on Self-Organizing Maps (WSOM'97), Espoo, Finland, June 4-6, 1997 , 1997 .

[12]  Joydeep Ghosh,et al.  Cluster Ensembles --- A Knowledge Reuse Framework for Combining Multiple Partitions , 2002, J. Mach. Learn. Res..

[13]  Michel Verleysen,et al.  Nonlinear Dimensionality Reduction , 2021, Computer Vision.

[14]  Abraham Kandel,et al.  Comparison of Distance Measures for Graph-Based Clustering of Documents , 2003, GbRPR.

[15]  Xin Liu,et al.  Document clustering based on non-negative matrix factorization , 2003, SIGIR.

[16]  M. F. Porter,et al.  An algorithm for suffix stripping , 1997 .

[17]  Timo Honkela,et al.  WEBSOM - Self-organizing maps of document collections , 1998, Neurocomputing.

[18]  D.M. Mount,et al.  An Efficient k-Means Clustering Algorithm: Analysis and Implementation , 2002, IEEE Trans. Pattern Anal. Mach. Intell..

[19]  Hinrich Schütze,et al.  Introduction to information retrieval , 2008 .

[20]  R. Mooney,et al.  Impact of Similarity Measures on Web-page Clustering , 2000 .