Subspace clustering for high dimensional data: a review

Subspace clustering is an extension of traditional clustering that seeks to find clusters in different subspaces within a dataset. Often in high dimensional data, many dimensions are irrelevant and can mask existing clusters in noisy data. Feature selection removes irrelevant and redundant dimensions by analyzing the entire dataset. Subspace clustering algorithms localize the search for relevant dimensions allowing them to find clusters that exist in multiple, possibly overlapping subspaces. There are two major branches of subspace clustering based on their search strategy. Top-down algorithms find an initial clustering in the full set of dimensions and evaluate the subspaces of each cluster, iteratively improving the results. Bottom-up approaches find dense regions in low dimensional spaces and combine them to form clusters. This paper presents a survey of the various subspace clustering algorithms along with a hierarchy organizing the algorithms by their defining characteristics. We then compare the two main approaches to subspace clustering using empirical scalability and accuracy tests and discuss some potential applications where subspace clustering could be particularly useful.

[1]  Anil K. Jain,et al.  Algorithms for Clustering Data , 1988 .

[2]  Ronald L. Rivest,et al.  Training a 3-node neural network is NP-complete , 1988, COLT '88.

[3]  Jiawei Han,et al.  Efficient and Effective Clustering Methods for Spatial Data Mining , 1994, VLDB.

[4]  Pattie Maes,et al.  Social information filtering: algorithms for automating “word of mouth” , 1995, CHI '95.

[5]  T. Kohonen,et al.  Exploratory Data Analysis by the Self-Organizing Map: Structures of Welfare and Poverty in the World , 1996 .

[6]  Ron Kohavi,et al.  Wrappers for Feature Subset Selection , 1997, Artif. Intell..

[7]  Mohamed Zaït,et al.  A comparative study of clustering methods , 1997, Future Gener. Comput. Syst..

[8]  Manoranjan Dash,et al.  Dimensionality reduction of unsupervised data , 1997, Proceedings Ninth IEEE International Conference on Tools with Artificial Intelligence.

[9]  G. M. D. Corso Estimating an Eigenvector by the Power Method with a Random Start , 1997 .

[10]  Ashwin Ram,et al.  Efficient Feature Selection in Conceptual Clustering , 1997, ICML.

[11]  Pat Langley,et al.  Selection of Relevant Features and Examples in Machine Learning , 1997, Artif. Intell..

[12]  Alberto O. Mendelzon,et al.  Database techniques for the World-Wide Web: a survey , 1998, SGMD.

[13]  Dimitrios Gunopulos,et al.  Automatic subspace clustering of high dimensional data for data mining applications , 1998, SIGMOD '98.

[14]  Hiroshi Motoda,et al.  Feature Selection for Knowledge Discovery and Data Mining , 1998, The Springer International Series in Engineering and Computer Science.

[15]  Christopher Leckie,et al.  An Evaluation of Criteria for Measuring the Quality of Clusters , 1999, IJCAI.

[16]  Daniel A. Keim,et al.  Optimal Grid-Clustering: Towards Breaking the Curse of Dimensionality in High-Dimensional Clustering , 1999, VLDB.

[17]  Anil K. Jain,et al.  Data clustering: a review , 1999, CSUR.

[18]  Daniel P. Fasulo,et al.  An Analysis of Recent Work on Clustering Algorithms , 1999 .

[19]  Mohammed J. Zaki,et al.  Clusterability Detection and Initial Seed Selection in Large Data Sets , 1999 .

[20]  Yi Zhang,et al.  Entropy-based subspace clustering for mining numerical data , 1999, KDD '99.

[21]  Philip S. Yu,et al.  Fast algorithms for projected clustering , 1999, SIGMOD '99.

[22]  Filippo Menczer,et al.  Feature selection in unsupervised learning via evolutionary search , 2000, KDD '00.

[23]  Luis Talavera,et al.  Dependency-based feature selection for clustering symbolic data , 2000, Intell. Data Anal..

[24]  Carla E. Brodley,et al.  Feature Subset Selection and Order Identification for Unsupervised Learning , 2000, ICML.

[25]  AlgorithmsI. Inza,et al.  Feature Weighting for Nearest Neighbor byEstimation of Bayesian Networks , 2000 .

[26]  Philip S. Yu,et al.  Finding generalized projected clusters in high dimensional spaces , 2000, SIGMOD '00.

[27]  Andrew McCallum,et al.  Efficient clustering of high-dimensional data sets with application to reference matching , 2000, KDD '00.

[28]  Huan Liu,et al.  Feature Selection for Clustering , 2000, Encyclopedia of Database Systems.

[29]  Philip S. Yu,et al.  Clustering through decision tree construction , 2000, CIKM '00.

[30]  Stefan Rüger,et al.  Feature Reduction for Document Clustering and Classification , 2000 .

[31]  Luis Talavera Dynamic Feature Selection in Incremental Hierarchical Clustering , 2000, ECML.

[32]  Jiawei Han,et al.  Data Mining: Concepts and Techniques , 2000 .

[33]  Noga Alon,et al.  Testing of clustering , 2000, Proceedings 41st Annual Symposium on Foundations of Computer Science.

[34]  Andrew W. Moore,et al.  X-means: Extending K-means with Efficient Estimation of the Number of Clusters , 2000, ICML.

[35]  Heikki Mannila,et al.  Random projection in dimensionality reduction: applications to image and text data , 2001, KDD '01.

[36]  Erica Kolatch,et al.  Clustering Algorithms for Spatial Databases: A Survey , 2001 .

[37]  Charu C. Aggarwal,et al.  Re-designing distance functions and distance-based applications for high dimensional data , 2001, SGMD.

[38]  Anthony K. H. Tung,et al.  Spatial clustering methods in data mining : A survey , 2001 .

[39]  Jeffrey T. Chang,et al.  Basic microarray analysis: grouping and feature reduction. , 2001, Trends in biotechnology.

[40]  Charu C. Aggarwal,et al.  On the Surprising Behavior of Distance Metrics in High Dimensional Spaces , 2001, ICDT.

[41]  Philip S. Yu,et al.  Outlier detection for high dimensional data , 2001, SIGMOD '01.

[42]  Huan Liu,et al.  '1+1>2': merging distance and density based clustering , 2001, Proceedings Seventh International Conference on Database Systems for Advanced Applications. DASFAA 2001.

[43]  John F. Roddick,et al.  Geographic Data Mining and Knowledge Discovery , 2001 .

[44]  Dimitris Achlioptas,et al.  Database-friendly random projections , 2001, PODS.

[45]  Pedro Larrañaga,et al.  Dimensionality Reduction in Unsupervised Learning of Conditional Gaussian Networks , 2001, IEEE Trans. Pattern Anal. Mach. Intell..

[46]  Heiner Stuckenschmidt,et al.  Ontology-Based Integration of Information - A Survey of Existing Approaches , 2001, OIS@IJCAI.

[47]  C. A. Murthy,et al.  Unsupervised Feature Selection Using Feature Similarity , 2002, IEEE Trans. Pattern Anal. Mach. Intell..

[48]  Michalis Vazirgiannis,et al.  Clustering validity checking methods: part II , 2002, SGMD.

[49]  Ian H. Witten,et al.  Data mining: practical machine learning tools and techniques with Java implementations , 2002, SGMD.

[50]  Marcos M. Campos,et al.  O-Cluster: scalable clustering of large high dimensional data sets , 2002, 2002 IEEE International Conference on Data Mining, 2002. Proceedings..

[51]  Chris H. Q. Ding,et al.  Adaptive dimension reduction for clustering high dimensional data , 2002, 2002 IEEE International Conference on Data Mining, 2002. Proceedings..

[52]  Jae-Woo Chang,et al.  A new cell-based clustering method for large, high-dimensional data in data mining applications , 2002, SAC '02.

[53]  Michalis Vazirgiannis,et al.  Cluster validity methods: part I , 2002, SGMD.

[54]  Jianhong Wu,et al.  Projective ART for clustering data sets in high dimensional spaces , 2002, Neural Networks.

[55]  Philip S. Yu,et al.  /spl delta/-clusters: capturing subspace correlation in a large data set , 2002, Proceedings 18th International Conference on Data Engineering.

[56]  J. Friedman Clustering objects on subsets of attributes , 2002 .

[57]  T. M. Murali,et al.  A Monte Carlo algorithm for fast projective clustering , 2002, SIGMOD '02.

[58]  A. Butte,et al.  Microarrays for an Integrative Genomics , 2002 .

[59]  Charu C. Aggarwal,et al.  Towards meaningful high-dimensional nearest neighbor search by human-computer interaction , 2002, Proceedings 18th International Conference on Data Engineering.

[60]  Huan Liu,et al.  Feature selection for clustering - a filter solution , 2002, 2002 IEEE International Conference on Data Mining, 2002. Proceedings..

[61]  Yi Li,et al.  COOLCAT: an entropy-based algorithm for categorical clustering , 2002, CIKM '02.

[62]  Eric Ka Ka Ng,et al.  Efficient algorithm for projected clustering , 2002, Proceedings 18th International Conference on Data Engineering.

[63]  Aidong Zhang,et al.  An iterative strategy for pattern discovery in high-dimensional data sets , 2002, CIKM '02.

[64]  Greg Hamerly,et al.  Learning the k in k-means , 2003, NIPS.

[65]  Vipin Kumar,et al.  Finding Clusters of Different Sizes, Shapes, and Densities in Noisy, High Dimensional Data , 2003, SDM.

[66]  Dimitrios Gunopulos,et al.  Dimensionality reduction by random projection and latent semantic indexing , 2003 .

[67]  Dmitriy Fradkin,et al.  Experiments with random projections for machine learning , 2003, KDD '03.

[68]  Daniel A. Keim,et al.  Using projections to visually cluster high-dimensional data , 2003, Comput. Sci. Eng..

[69]  Carla E. Brodley,et al.  Random Projection for High Dimensional Data Clustering: A Cluster Ensemble Approach , 2003, ICML.

[70]  Shenghuo Zhu,et al.  Algorithms for clustering high dimensional and distributed data , 2003, Intell. Data Anal..

[71]  Subbarao Kambhampati,et al.  Frequency-Based Coverage Statistics Mining for Data Integration , 2003, IIWeb.

[72]  Huan Liu,et al.  Feature Selection for High-Dimensional Data: A Fast Correlation-Based Filter Solution , 2003, ICML.

[73]  J. Friedman,et al.  Clustering objects on subsets of attributes (with discussion) , 2004 .

[74]  Myoung-Ho Kim,et al.  FINDIT: a fast and intelligent subspace clustering algorithm using dimension voting , 2004, Inf. Softw. Technol..

[75]  Jianhong Wu,et al.  Subspace clustering for high dimensional categorical data , 2004, SKDD.

[76]  Pavel Berkhin,et al.  A Survey of Clustering Data Mining Techniques , 2006, Grouping Multidimensional Data.