Automatic subspace clustering of high dimensional data for data mining applications

Data mining applications place special requirements on clustering algorithms including: the ability to find clusters embedded in subspaces of high dimensional data, scalability, end-user comprehensibility of the results, non-presumption of any canonical data distribution, and insensitivity to the order of input records. We present CLIQUE, a clustering algorithm that satisfies each of these requirements. CLIQUE identifies dense clusters in subspaces of maximum dimensionality. It generates cluster descriptions in the form of DNF expressions that are minimized for ease of comprehension. It produces identical results irrespective of the order in which input records are presented and does not presume any specific mathematical form for data distribution. Through experiments, we show that CLIQUE efficiently finds accurate cluster in large high dimensional datasets.

[1]  P. Sneath,et al.  Numerical Taxonomy , 1962, Nature.

[2]  Keinosuke Fukunaga,et al.  Introduction to Statistical Pattern Recognition , 1972 .

[3]  Richard O. Duda,et al.  Pattern classification and scene analysis , 1974, A Wiley-Interscience publication.

[4]  Alfred V. Aho,et al.  The Design and Analysis of Computer Algorithms , 1974 .

[5]  Peter E. Hart,et al.  Pattern classification and scene analysis , 1974, A Wiley-Interscience publication.

[6]  László Lovász,et al.  On the ratio of optimal integral and fractional covers , 1975, Discret. Math..

[7]  R. Chhikara,et al.  A Numerical Classification Method for Partitioning of a Large Multidimensional Mixed Data Set , 1979 .

[8]  Stephen W. Wharton A generalized histogram clustering scheme for multidimensional image data , 1983, Pattern Recognit..

[9]  R. Michalski,et al.  Learning from Observation: Conceptual Clustering , 1983 .

[10]  Leslie G. Valiant,et al.  A theory of the learnable , 1984, STOC '84.

[11]  Editors , 1986, Brain Research Bulletin.

[12]  Adrian Bowyer,et al.  CSG set-theoretic solid modelling and NC machining of blend surfaces , 1986, SCG '86.

[13]  Peter J. Rousseeuw,et al.  Robust regression and outlier detection , 1987 .

[14]  Joseph C. Culberson,et al.  Covering a simple orthogonal polygon with a minimum number of orthogonally convex polygons , 1987, SCG '87.

[15]  Anil K. Jain,et al.  Algorithms for Clustering Data , 1988 .

[16]  Deborah S. Franzblau Performance Guarantees on a Sweep-Line Heuristic for Covering Rectilinear Polygons with Rectangles , 1989, SIAM J. Discret. Math..

[17]  Keinosuke Fukunaga,et al.  Introduction to statistical pattern recognition (2nd ed.) , 1990 .

[18]  Peter J. Rousseeuw,et al.  Finding Groups in Data: An Introduction to Cluster Analysis , 1991 .

[19]  Ronald L. Rivest,et al.  The Design and Analysis of Computer Algorithms , 1990 .

[20]  Isidore Rigoutsos,et al.  An algorithm for point clustering and grid generation , 1991, IEEE Trans. Syst. Man Cybern..

[21]  Tomasz Imielinski,et al.  An Interval Classifier for Database Mining Applications , 1992, VLDB.

[22]  Tomasz Imielinski,et al.  Database Mining: A Performance Perspective , 1993, IEEE Trans. Knowl. Data Eng..

[23]  Jiawei Han,et al.  Efficient and Effective Clustering Methods for Spatial Data Mining , 1994, VLDB.

[24]  Carsten Lund,et al.  On the hardness of approximating minimization problems , 1994, JACM.

[25]  R. Ng,et al.  Eecient and Eeective Clustering Methods for Spatial Data Mining , 1994 .

[26]  Hans-Peter Kriegel,et al.  A Database Interface for Clustering in Large Spatial Databases , 1995, KDD.

[27]  Josef Bigün,et al.  Hierarchical image segmentation by multi-dimensional clustering and orientation-adaptive boundary refinement , 1995, Pattern Recognit..

[28]  R. Agrawal,et al.  Mining quantitative association rules in large relational tables , 1996, SIGMOD '96.

[29]  Jeffrey F. Naughton,et al.  On the Computation of Multidimensional Aggregates , 1996, VLDB.

[30]  Rakesh Agrawal,et al.  SPRINT: A Scalable Parallel Classifier for Data Mining , 1996, VLDB.

[31]  Prabhakar Raghavan,et al.  A Linear Method for Deviation Detection in Large Databases , 1996, KDD.

[32]  Tian Zhang,et al.  BIRCH: an efficient data clustering method for very large databases , 1996, SIGMOD '96.

[33]  Heikki Mannila,et al.  Verkamo: Fast Discovery of Association Rules , 1996, KDD 1996.

[34]  Jorma Rissanen,et al.  SLIQ: A Fast Scalable Classifier for Data Mining , 1996, EDBT.

[35]  Peter C. Cheeseman,et al.  Bayesian Classification (AutoClass): Theory and Results , 1996, Advances in Knowledge Discovery and Data Mining.

[36]  Hans-Peter Kriegel,et al.  A Density-Based Algorithm for Discovering Clusters in Large Spatial Databases with Noise , 1996, KDD.

[37]  Uriel Feige A threshold of ln n for approximating set cover (preliminary version) , 1996, STOC '96.

[38]  Phipps Arabie,et al.  AN OVERVIEW OF COMBINATORIAL DATA ANALYSIS , 1996 .

[39]  Ramakrishnan Srikant,et al.  Mining quantitative association rules in large relational tables , 1996, SIGMOD '96.

[40]  Gregory Piatetsky-Shapiro,et al.  Selecting and reporting What Is Interesting , 1996, Advances in Knowledge Discovery and Data Mining.

[41]  Heikki Mannila,et al.  Fast Discovery of Association Rules , 1996, Advances in Knowledge Discovery and Data Mining.

[42]  Hannu Toivonen,et al.  Sampling Large Databases for Association Rules , 1996, VLDB.

[43]  Mohamed Zaït,et al.  A comparative study of clustering methods , 1997, Future Gener. Comput. Syst..

[44]  Nimrod Megiddo,et al.  Range queries in OLAP data cubes , 1997, SIGMOD '97.

[45]  Rajeev Motwani,et al.  Dynamic itemset counting and implication rules for market basket data , 1997, SIGMOD '97.

[46]  Dimitrios Gunopulos,et al.  Data mining, hypergraph transversals, and machine learning (extended abstract) , 1997, PODS.

[47]  N. Megiddo,et al.  Range queries in OLAP data cubes , 1997, SIGMOD '97.

[48]  Renée J. Miller,et al.  Association rules over interval data , 1997, SIGMOD '97.

[49]  Dimitrios Gunopulos,et al.  Data mining, hypergraph transversals, and machine learning (extended abstract) , 1997, PODS '97.

[50]  Christian Böhm,et al.  A cost model for nearest neighbor search in high-dimensional data space , 1997, PODS.

[51]  Zvi M. Kedem,et al.  Pincer-Search: A New Algorithm for Discovering the Maximum Frequent Set , 1998, EDBT.

[52]  U. Feige A threshold of ln n for approximating set cover , 1998, JACM.

[53]  Jorma Rissanen,et al.  Stochastic Complexity in Statistical Inquiry , 1989, World Scientific Series in Computer Science.

[54]  Roberto J. Bayardo,et al.  Efficiently mining long patterns from databases , 1998, SIGMOD '98.

[55]  Hans-Peter Kriegel,et al.  Knowledge Discovery in Spatial Databases , 1999, KI.

[56]  D. S. FRANZBLAUf PERFORMANCE GUARANTEES ON A SWEEP-LINE HEURISTIC FOR COVERING RECTILINEAR POLYGONS WITH RECTANGLES * , .