Dynamic cluster formation using level set methods

Density-based clustering has the advantages for: 1) allowing arbitrary shape of cluster and 2) not requiring the number of clusters as input. However, when clusters touch each other, both the cluster centers and cluster boundaries (as the peaks and valleys of the density distribution) become fuzzy and difficult to determine. We introduce the notion of cluster intensity function (CIF) which captures the important characteristics of clusters. When clusters are well-separated, CIFs are similar to density functions. But, when clusters become closed to each other, CIFs still clearly reveal cluster centers, cluster boundaries, and degree of membership of each data point to the cluster that it belongs. Clustering through bump hunting and valley seeking based on these functions are more robust than that based on density functions obtained by kernel density estimation, which are often oscillatory or oversmoothed. These problems of kernel density estimation are resolved using level set methods and related techniques. Comparisons with two existing density-based methods, valley seeking and DBSCAN, are presented which illustrate the advantages of our approach.

[1]  J. Wade Davis,et al.  Statistical Pattern Recognition , 2003, Technometrics.

[2]  J. MacQueen Some methods for classification and analysis of multivariate observations , 1967 .

[3]  Markus Hegland,et al.  Parallelisation of Sparse Grids for Large Scale Data Analysis , 2003, International Conference on Computational Science.

[4]  T. Chan,et al.  A Variational Level Set Approach to Multiphase Motion , 1996 .

[5]  Michael Griebel,et al.  Data Mining with Sparse Grids , 2001, Computing.

[6]  D. W. Scott,et al.  Cross-Validation of Multivariate Densities , 1994 .

[7]  Pavel Pudil,et al.  Introduction to Statistical Pattern Recognition , 2006 .

[8]  Inderjit S. Dhillon,et al.  Clustering with Bregman Divergences , 2005, J. Mach. Learn. Res..

[9]  G. Sapiro,et al.  Geometric partial differential equations and image analysis [Book Reviews] , 2001, IEEE Transactions on Medical Imaging.

[10]  Hava T. Siegelmann,et al.  Support Vector Clustering , 2002, J. Mach. Learn. Res..

[11]  Ronald Fedkiw,et al.  Level set methods and dynamic implicit surfaces , 2002, Applied mathematical sciences.

[12]  Hans-Peter Kriegel,et al.  A Density-Based Algorithm for Discovering Clusters in Large Spatial Databases with Noise , 1996, KDD.

[13]  Jiawei Han,et al.  Efficient and Effective Clustering Methods for Spatial Data Mining , 1994, VLDB.

[14]  Stanley Osher,et al.  Fast Sweeping Algorithms for a Class of Hamilton-Jacobi Equations , 2003, SIAM J. Numer. Anal..

[15]  Hans-Peter Kriegel,et al.  Density-Based Clustering in Spatial Databases: The Algorithm GDBSCAN and Its Applications , 1998, Data Mining and Knowledge Discovery.

[16]  Jiong Yang,et al.  STING: A Statistical Information Grid Approach to Spatial Data Mining , 1997, VLDB.

[17]  T. Chaundy HYPERGEOMETRIC PARTIAL DIFFERENTIAL EQUATIONS (I) , 1935 .

[18]  Chris H. Q. Ding,et al.  Cluster Aggregate Inequality and Multi-level Hierarchical Clustering , 2005, PKDD.

[19]  Chris H. Q. Ding,et al.  A min-max cut algorithm for graph partitioning and data clustering , 2001, Proceedings 2001 IEEE International Conference on Data Mining.

[20]  P. Deb Finite Mixture Models , 2008 .

[21]  Shigeo Abe DrEng Pattern Classification , 2001, Springer London.

[22]  R. Sharan,et al.  CLICK: a clustering algorithm with applications to gene expression analysis. , 2000, Proceedings. International Conference on Intelligent Systems for Molecular Biology.

[23]  Dimitrios Gunopulos,et al.  Automatic subspace clustering of high dimensional data for data mining applications , 1998, SIGMOD '98.

[24]  Tony F. Chan,et al.  Active contours without edges , 2001, IEEE Trans. Image Process..

[25]  Anil K. Jain Fundamentals of Digital Image Processing , 2018, Control of Color Imaging Systems.

[26]  B. Jaumard,et al.  Cluster Analysis and Mathematical Programming , 2003 .

[27]  Michael Ruogu Zhang,et al.  Comprehensive identification of cell cycle-regulated genes of the yeast Saccharomyces cerevisiae by microarray hybridization. , 1998, Molecular biology of the cell.

[28]  J. Sethian,et al.  FRONTS PROPAGATING WITH CURVATURE DEPENDENT SPEED: ALGORITHMS BASED ON HAMILTON-JACOB1 FORMULATIONS , 2003 .

[29]  Pavel Berkhin,et al.  A Survey of Clustering Data Mining Techniques , 2006, Grouping Multidimensional Data.

[30]  Kohji Fukunaga,et al.  Introduction to Statistical Pattern Recognition-Second Edition , 1990 .

[31]  Anil K. Jain,et al.  Data clustering: a review , 1999, CSUR.

[32]  Aidong Zhang,et al.  WaveCluster: a wavelet-based clustering approach for spatial data in very large databases , 2000, The VLDB Journal.

[33]  V. Caselles,et al.  A geometric model for active contours in image processing , 1993 .

[34]  George Karypis,et al.  C HAMELEON : A Hierarchical Clustering Algorithm Using Dynamic Modeling , 1999 .

[35]  Jiawei Han,et al.  CLARANS: A Method for Clustering Objects for Spatial Data Mining , 2002, IEEE Trans. Knowl. Data Eng..

[36]  Marko Subasic,et al.  Level Set Methods and Fast Marching Methods , 2003 .

[37]  J. Sethian,et al.  Fronts propagating with curvature-dependent speed: algorithms based on Hamilton-Jacobi formulations , 1988 .

[38]  Daniel A. Keim,et al.  An Efficient Approach to Clustering in Large Multimedia Databases with Noise , 1998, KDD.

[39]  E. Parzen On Estimation of a Probability Density Function and Mode , 1962 .

[40]  Ming Gu,et al.  Spectral min-max cut for graph partitioning and data clustering , 2001 .

[41]  Peter J. Rousseeuw,et al.  Finding Groups in Data: An Introduction to Cluster Analysis , 1991 .

[42]  Jiong Yang,et al.  An Approach to Active Spatial Data Mining Based on Statistical Information , 2000, IEEE Trans. Knowl. Data Eng..

[43]  G. M.,et al.  Partial Differential Equations I , 2023, Applied Mathematical Sciences.

[44]  Hans-Joachim Bungartz,et al.  Acta Numerica 2004: Sparse grids , 2004 .

[45]  Ali S. Hadi,et al.  Finding Groups in Data: An Introduction to Chster Analysis , 1991 .

[46]  Michael Griebel,et al.  Classification with sparse grids using simplicial basis functions , 2002, Intell. Data Anal..

[47]  James A. Sethian,et al.  Level Set Methods and Fast Marching Methods , 1999 .

[48]  Vipin Kumar,et al.  Finding Clusters of Different Sizes, Shapes, and Densities in Noisy, High Dimensional Data , 2003, SDM.

[49]  Keinosuke Fukunaga,et al.  Introduction to statistical pattern recognition (2nd ed.) , 1990 .

[50]  Michael Griebel,et al.  Data mining with sparse grids using simplicial basis functions , 2001, KDD '01.

[51]  Vipin Kumar,et al.  Chameleon: Hierarchical Clustering Using Dynamic Modeling , 1999, Computer.

[52]  Petra Perner,et al.  Data Mining - Concepts and Techniques , 2002, Künstliche Intell..

[53]  S. P. Lloyd,et al.  Least squares quantization in PCM , 1982, IEEE Trans. Inf. Theory.

[54]  Hans-Peter Kriegel,et al.  OPTICS: ordering points to identify the clustering structure , 1999, SIGMOD '99.

[55]  Markus Hegland,et al.  Parallelisation of sparse grids for large scale data analysis , 2006 .

[56]  Roded Sharan,et al.  Center CLICK: A Clustering Algorithm with Applications to Gene Expression Analysis , 2000, ISMB.