Subspace clustering

Subspace clustering refers to the task of identifying clusters of similar objects or data records (vectors) where the similarity is defined with respect to a subset of the attributes (i.e., a subspace of the data space). The subspace is not necessarily (and actually is usually not) the same for different clusters within one clustering solution. In this article, the problems motivating subspace clustering are sketched, different definitions and usages of subspaces for clustering are described, and exemplary algorithmic solutions are discussed. Finally, we sketch current research directions. © 2012 Wiley Periodicals, Inc.

[1]  Anthony K. H. Tung,et al.  Mining frequent closed cubes in 3D datasets , 2006, VLDB.

[2]  Inderjit S. Dhillon,et al.  Simultaneous Unsupervised Learning of Disparate Clusterings , 2008 .

[3]  Huan Liu,et al.  Subspace clustering for high dimensional data: a review , 2004, SKDD.

[4]  A. Zimek,et al.  On Using Class-Labels in Evaluation of Clusterings , 2010 .

[5]  Jonathan Goldstein,et al.  When Is ''Nearest Neighbor'' Meaningful? , 1999, ICDT.

[6]  Ira Assent,et al.  Less is More: Non-Redundant Subspace Clustering , 2010 .

[7]  Christian Böhm,et al.  Searching in high-dimensional spaces: Index structures for improving the performance of multimedia databases , 2001, CSUR.

[8]  Christos Faloutsos,et al.  On the 'Dimensionality Curse' and the 'Self-Similarity Blessing' , 2001, IEEE Trans. Knowl. Data Eng..

[9]  Yi Zhang,et al.  Entropy-based subspace clustering for mining numerical data , 1999, KDD '99.

[10]  Marina Meila,et al.  Comparing subspace clusterings , 2006, IEEE Transactions on Knowledge and Data Engineering.

[11]  Xiaodi Huang,et al.  A Fast Algorithm for Finding Correlation Clusters in Noise Data , 2007, PAKDD.

[12]  Ying Cui,et al.  Non-redundant Multi-view Clustering via Orthogonalization , 2007, Seventh IEEE International Conference on Data Mining (ICDM 2007).

[13]  Christian Böhm,et al.  Density connected clustering with local subspace preferences , 2004, Fourth IEEE International Conference on Data Mining (ICDM'04).

[14]  Philip S. Yu,et al.  Fast algorithms for projected clustering , 1999, SIGMOD '99.

[15]  Jinyan Li,et al.  Mining Maximal Quasi-Bicliques to Co-Cluster Stocks and Financial Ratios for Value Investment , 2006, Sixth International Conference on Data Mining (ICDM'06).

[16]  Ira Assent,et al.  Relevant Subspace Clustering: Mining the Most Interesting Non-redundant Concepts in High Dimensional Data , 2009, 2009 Ninth IEEE International Conference on Data Mining.

[17]  Michel Verleysen,et al.  The Concentration of Fractional Distances , 2007, IEEE Transactions on Knowledge and Data Engineering.

[18]  Hans-Peter Kriegel,et al.  A generic framework for efficient subspace clustering of high-dimensional data , 2005, Fifth IEEE International Conference on Data Mining (ICDM'05).

[19]  Arthur Zimek,et al.  Subspace Clustering Techniques , 2009, Encyclopedia of Database Systems.

[20]  Hans-Peter Kriegel,et al.  Evaluation of Multiple Clustering Solutions , 2011, MultiClust@ECML/PKDD.

[21]  Steffen Bickel,et al.  Multi-view clustering , 2004, Fourth IEEE International Conference on Data Mining (ICDM'04).

[22]  Ira Assent,et al.  CLICKS: an effective algorithm for mining subspace clusters in categorical datasets , 2005, KDD '05.

[23]  Myoung-Ho Kim,et al.  FINDIT: a fast and intelligent subspace clustering algorithm using dimension voting , 2004, Inf. Softw. Technol..

[24]  Elke Achtert,et al.  On Exploring Complex Relationships of Correlation Clusters , 2007, 19th International Conference on Scientific and Statistical Database Management (SSDBM 2007).

[25]  Rakesh Agarwal,et al.  Fast Algorithms for Mining Association Rules , 1994, VLDB 1994.

[26]  Suresh Venkatasubramanian,et al.  Generating a Diverse Set of High-Quality Clusterings , 2011, MultiClust@ECML/PKDD.

[27]  Xiang Zhang,et al.  CARE: Finding Local Linear Correlations in High Dimensional Data , 2008, 2008 IEEE 24th International Conference on Data Engineering.

[28]  James Bailey,et al.  Generation of Alternative Clusterings Using the CAMI Approach , 2010, SDM.

[29]  E. Fowlkes,et al.  Variable selection in clustering , 1988 .

[30]  M. Brusco,et al.  Selection of Variables in Cluster Analysis: An Empirical Comparison of Eight Procedures , 2008 .

[31]  J. Friedman,et al.  Clustering objects on subsets of attributes (with discussion) , 2004 .

[32]  Chandan K. Reddy,et al.  A Robust Seedless Algorithm for Correlation Clustering , 2010, PAKDD.

[33]  Cordelia Schmid,et al.  High-dimensional data clustering , 2006, Comput. Stat. Data Anal..

[34]  Elke Achtert,et al.  Robust, Complete, and Efficient Correlation Clustering , 2007, SDM.

[35]  G. W. Milligan,et al.  An examination of the effect of six types of error perturbation on fifteen clustering algorithms , 1980 .

[36]  Michael K. Ng,et al.  On discovery of extremely low-dimensional clusters using semi-supervised projected clustering , 2005, 21st International Conference on Data Engineering (ICDE'05).

[37]  A. Zimek,et al.  Deriving quantitative models for correlation clusters , 2006, KDD '06.

[38]  Hans-Peter Kriegel,et al.  Density Based Subspace Clustering over Dynamic Data , 2011, SSDBM.

[39]  Arlindo L. Oliveira,et al.  Biclustering algorithms for biological data analysis: a survey , 2004, IEEE/ACM Transactions on Computational Biology and Bioinformatics.

[40]  A. Zimek,et al.  Subspace Clustering, Ensemble Clustering, Alternative Clustering, Multiview Clustering: What Can We Learn From Each Other? , 2010 .

[41]  Inderjit S. Dhillon,et al.  Simultaneous Unsupervised Learning of Disparate Clusterings , 2008, Stat. Anal. Data Min..

[42]  Hans-Peter Kriegel,et al.  Quality of Similarity Rankings in Time Series , 2011, SSTD.

[43]  Ramakrishnan Srikant,et al.  Fast algorithms for mining association rules , 1998, VLDB 1998.

[44]  Ira Assent,et al.  Evaluating Clustering in Subspace Projections of High Dimensional Data , 2009, Proc. VLDB Endow..

[45]  Michael K. Ng,et al.  HARP: a practical projected clustering algorithm , 2004, IEEE Transactions on Knowledge and Data Engineering.

[46]  Hans-Peter Kriegel,et al.  Can Shared-Neighbor Distances Defeat the Curse of Dimensionality? , 2010, SSDBM.

[47]  Isabelle Guyon,et al.  An Introduction to Variable and Feature Selection , 2003, J. Mach. Learn. Res..

[48]  Arthur Zimek,et al.  When pattern met subspace cluster a relationship story , 2011 .

[49]  Li Yang Distance‐preserving dimensionality reduction , 2011, Wiley Interdiscip. Rev. Data Min. Knowl. Discov..

[50]  Man Lung Yiu,et al.  Iterative projected clustering by subspace mining , 2005, IEEE Transactions on Knowledge and Data Engineering.

[51]  Hans-Peter Kriegel,et al.  Clustering high-dimensional data: A survey on subspace clustering, pattern-based clustering, and correlation clustering , 2009, TKDD.

[52]  Anthony Wirth,et al.  Correlation Clustering , 2010, Encyclopedia of Machine Learning and Data Mining.

[53]  Philip S. Yu,et al.  A Framework for Projected Clustering of High Dimensional Data Streams , 2004, VLDB.

[54]  Elke Achtert,et al.  Detection and Visualization of Subspace Cluster Hierarchies , 2007, DASFAA.

[55]  Arthur Zimek,et al.  A survey on enhanced subspace clustering , 2013, Data Mining and Knowledge Discovery.

[56]  Kelvin Sim,et al.  Mining Actionable Subspace Clusters in Sequential Data , 2010, SDM.

[57]  Ian Davidson,et al.  A principled and flexible framework for finding alternative clusterings , 2009, KDD.

[58]  Martin Ester,et al.  Density‐based clustering , 2019, WIREs Data Mining Knowl. Discov..

[59]  Robert M. Haralick,et al.  Mining Subspace Correlations , 2007, 2007 IEEE Symposium on Computational Intelligence and Data Mining.

[60]  Sharad Mehrotra,et al.  Local Dimensionality Reduction: A New Approach to Indexing High Dimensional Spaces , 2000, VLDB.

[61]  Emmanuel Müller,et al.  Detection of orthogonal concepts in subspaces of high dimensional data , 2009, CIKM.

[62]  James Bailey,et al.  COALA: A Novel Approach for the Extraction of an Alternate Clustering of High Quality and High Dissimilarity , 2006, Sixth International Conference on Data Mining (ICDM'06).

[63]  Kristin P. Bennett,et al.  Density-based indexing for approximate nearest-neighbor queries , 1999, KDD '99.

[64]  Dimitrios Gunopulos,et al.  Locally adaptive metrics for clustering high dimensional data , 2007, Data Mining and Knowledge Discovery.

[65]  D. Rubin,et al.  Maximum likelihood from incomplete data via the EM - algorithm plus discussions on the paper , 1977 .

[66]  Ira Assent,et al.  DUSC: Dimensionality Unbiased Subspace Clustering , 2007, Seventh IEEE International Conference on Data Mining (ICDM 2007).

[67]  Dimitrios Gunopulos,et al.  Automatic subspace clustering of high dimensional data for data mining applications , 1998, SIGMOD '98.

[68]  Christian Böhm,et al.  Computing Clusters of Correlation Connected objects , 2004, SIGMOD '04.

[69]  Thomas Hofmann,et al.  Non-redundant data clustering , 2004, Fourth IEEE International Conference on Data Mining (ICDM'04).

[70]  Jörg Sander,et al.  Finding non-redundant, statistically significant regions in high dimensional data: a novel approach to projected and subspace clustering , 2008, KDD.

[71]  Jianhong Wu,et al.  Data clustering - theory, algorithms, and applications , 2007 .

[72]  Qi Zhang,et al.  Incremental Subspace Clustering over Multiple Data Streams , 2007, Seventh IEEE International Conference on Data Mining (ICDM 2007).

[73]  Jiawei Han,et al.  Data Mining: Concepts and Techniques , 2000 .

[74]  Hans-Peter Kriegel,et al.  Subspace and projected clustering: experimental evaluation and analysis , 2009, Knowledge and Information Systems.

[75]  Mohammed J. Zaki,et al.  TRICLUSTER: an effective algorithm for mining coherent clusters in 3D microarray data , 2005, SIGMOD '05.

[76]  Ira Assent,et al.  HSM: Heterogeneous Subspace Mining in High Dimensional Data , 2009, SSDBM.

[77]  Shengrui Wang,et al.  Particle swarm optimizer for variable weighting in clustering high-dimensional data , 2009, 2009 IEEE Swarm Intelligence Symposium.

[78]  R. Gnanadesikan,et al.  Weighting and selection of variables for cluster analysis , 1995 .

[79]  Michael K. Ng,et al.  Automated variable weighting in k-means type clustering , 2005, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[80]  Dimitrios Gunopulos,et al.  Subspace Clustering of High Dimensional Data , 2004, SDM.

[81]  Elke Achtert,et al.  Spatial Outlier Detection: Data, Algorithms, Visualizations , 2011, SSTD.

[82]  Philip S. Yu,et al.  Finding generalized projected clusters in high dimensional spaces , 2000, SIGMOD 2000.

[83]  Hui Yang,et al.  Data Mining: Concepts and Techniques , 2008 .

[84]  Richard Bellman,et al.  Adaptive Control Processes: A Guided Tour , 1961, The Mathematical Gazette.

[85]  Charu C. Aggarwal,et al.  On the Surprising Behavior of Distance Metrics in High Dimensional Spaces , 2001, ICDT.

[86]  Christos Faloutsos,et al.  Finding Clusters in subspaces of very large, multi-dimensional datasets , 2010, 2010 IEEE 26th International Conference on Data Engineering (ICDE 2010).

[87]  Heng Tao Shen,et al.  Principal Component Analysis , 2009, Encyclopedia of Biometrics.

[88]  Elke Achtert,et al.  Robust Clustering in Arbitrarily Oriented Subspaces , 2008, SDM.

[89]  Alok N. Choudhary,et al.  Adaptive Grids for Clustering Massive Data Sets , 2001, SDM.

[90]  Philip S. Yu,et al.  Finding generalized projected clusters in high dimensional spaces , 2000, SIGMOD '00.

[91]  Ian Davidson,et al.  MultiClust 2010: discovering, summarizing and using multiple clusterings , 2011, SKDD.

[92]  Robert M. Haralick,et al.  Linear Manifold Clustering , 2005, MLDM.

[93]  Ira Assent,et al.  External evaluation measures for subspace clustering , 2011, CIKM '11.

[94]  Thomas Seidl,et al.  Subspace Clustering for Uncertain Data , 2010, SDM.

[95]  S. S. Ravi,et al.  A SAT-based Framework for Efficient Constrained Clustering , 2010, SDM.

[96]  Robert C. Bolles,et al.  Random sample consensus: a paradigm for model fitting with applications to image analysis and automated cartography , 1981, CACM.

[97]  Ira Assent,et al.  DensEst: Density Estimation for Data Mining in High Dimensional Spaces , 2009, SDM.

[98]  Ian Davidson,et al.  Finding Alternative Clusterings Using Constraints , 2008, 2008 Eighth IEEE International Conference on Data Mining.

[99]  René Vidal,et al.  Subspace Clustering , 2011, IEEE Signal Processing Magazine.

[100]  Martin Ester,et al.  Robust projected clustering , 2007, Knowledge and Information Systems.

[101]  Jinyan Li,et al.  Efficient mining of distance‐based subspace clusters , 2009, Stat. Anal. Data Min..

[102]  T. Seidl,et al.  ASCLU : Alternative Subspace Clustering , 2010 .

[103]  Michael K. Ng,et al.  An Entropy Weighting k-Means Algorithm for Subspace Clustering of High-Dimensional Sparse Data , 2007, IEEE Transactions on Knowledge and Data Engineering.

[104]  Elke Achtert,et al.  Global Correlation Clustering Based on the Hough Transform , 2008, Stat. Anal. Data Min..

[105]  Thomas Hofmann,et al.  Non-redundant clustering with conditional ensembles , 2005, KDD '05.

[106]  C. Mallows,et al.  A Method for Comparing Two Hierarchical Clusterings , 1983 .

[107]  T. M. Murali,et al.  A Monte Carlo algorithm for fast projective clustering , 2002, SIGMOD '02.

[108]  Hans-Peter Kriegel,et al.  Density-Connected Subspace Clustering for High-Dimensional Data , 2004, SDM.