K-Subspace Clustering

The widely used K-means clustering deals with ball-shaped (spherical Gaussian) clusters. In this paper, we extend the K-means clustering to accommodate extended clusters in subspaces, such as line-shaped clusters, plane-shaped clusters, and ball-shaped clusters. The algorithm retains much of the K-means clustering flavors: easy to implement and fast to converge. A model selection procedure is incorporated to determine the cluster shape. As a result, our algorithm can recognize a wide range of subspace clusters studied in various literatures, and also the global ball-shaped clusters (living in all dimensions). We carry extensive experiments on both synthetic and real-world datasets, and the results demonstrate the effectiveness of our algorithm.

[1]  G. W. Milligan,et al.  An algorithm for generating artificial test clusters , 1985 .

[2]  Dimitrios Gunopulos,et al.  Automatic subspace clustering of high dimensional data for data mining applications , 1998, SIGMOD '98.

[3]  Catherine Blake,et al.  UCI Repository of machine learning databases , 1998 .

[4]  Yi Zhang,et al.  Entropy-based subspace clustering for mining numerical data , 1999, KDD '99.

[5]  Philip S. Yu,et al.  Fast algorithms for projected clustering , 1999, SIGMOD '99.

[6]  H. Sebastian Seung,et al.  Algorithms for Non-negative Matrix Factorization , 2000, NIPS.

[7]  Philip S. Yu,et al.  Finding generalized projected clusters in high dimensional spaces , 2000, SIGMOD '00.

[8]  Philip S. Yu,et al.  Clustering through decision tree construction , 2000, CIKM '00.

[9]  Philip S. Yu,et al.  /spl delta/-clusters: capturing subspace correlation in a large data set , 2002, Proceedings 18th International Conference on Data Engineering.

[10]  J. Friedman Clustering objects on subsets of attributes , 2002 .

[11]  T. M. Murali,et al.  A Monte Carlo algorithm for fast projective clustering , 2002, SIGMOD '02.

[12]  Joydeep Ghosh,et al.  Cluster Ensembles --- A Knowledge Reuse Framework for Combining Multiple Partitions , 2002, J. Mach. Learn. Res..

[13]  Xin Liu,et al.  Document clustering based on non-negative matrix factorization , 2003, SIGIR.

[14]  Inderjit S. Dhillon,et al.  Information-theoretic co-clustering , 2003, KDD '03.

[15]  Jianbo Shi,et al.  Multiclass spectral clustering , 2003, Proceedings Ninth IEEE International Conference on Computer Vision.

[16]  J. Friedman,et al.  Clustering objects on subsets of attributes (with discussion) , 2004 .

[17]  Pietro Perona,et al.  Self-Tuning Spectral Clustering , 2004, NIPS.

[18]  Myoung-Ho Kim,et al.  FINDIT: a fast and intelligent subspace clustering algorithm using dimension voting , 2004, Inf. Softw. Technol..

[19]  Inderjit S. Dhillon,et al.  Kernel k-means: spectral clustering and normalized cuts , 2004, KDD.

[20]  Tao Li,et al.  Document clustering via adaptive subspace iteration , 2004, SIGIR '04.

[21]  Inderjit S. Dhillon,et al.  Minimum Sum-Squared Residue Co-Clustering of Gene Expression Data , 2004, SDM.

[22]  Huan Liu,et al.  Subspace clustering for high dimensional data: a review , 2004, SKDD.

[23]  Chris H. Q. Ding,et al.  On the Equivalence of Nonnegative Matrix Factorization and Spectral Clustering , 2005, SDM.

[24]  S. Shankar Sastry,et al.  Generalized principal component analysis (GPCA) , 2005, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[25]  Abraham Kandel,et al.  Advances in Web Intelligence and Data Mining , 2006, Studies in Computational Intelligence.

[26]  Dimitris K. Tasoulis,et al.  Oriented k-windows: A PCA driven clustering method , 2006, Advances in Web Intelligence and Data Mining.

[27]  Christopher M. Bishop,et al.  Pattern Recognition and Machine Learning (Information Science and Statistics) , 2006 .

[28]  Chris H. Q. Ding,et al.  Orthogonal nonnegative matrix t-factorizations for clustering , 2006, KDD '06.

[29]  C. Ding,et al.  Adaptive dimension reduction using discriminant analysis and K-means clustering , 2007, ICML '07.

[30]  Qi Zhang,et al.  Incremental Subspace Clustering over Multiple Data Streams , 2007, Seventh IEEE International Conference on Data Mining (ICDM 2007).

[31]  Radford M. Neal Pattern Recognition and Machine Learning , 2007, Technometrics.