Dynamic Cluster Formation Using Level Set Methods

Density-based clustering has the advantages for (i) allowing arbitrary shape of cluster and (ii) not requiring the number of clusters as input. However, when clusters touch each other, both the cluster centers and cluster boundaries (as the peaks and valleys of the density distribution) become fuzzy and difficult to determine. In higher dimension, the boundaries become wiggly and over-fitting often occurs. We introduce the notion of cluster intensity function (CIF) which captures the important characteristics of clusters. When clusters are well-separated, CIFs are similar to density functions. But as clusters touch each other, CIFs still clearly reveal cluster centers, cluster boundaries, and, degree of membership of each data point to the cluster that it belongs. Clustering through bump hunting and valley seeking based on these functions are more robust than that based on kernel density functions which are often oscillatory or over-smoothed. These problems of kernel density estimation are resolved using Level Set Methods and related techniques. Comparisons with two existing density-based methods, valley seeking and DBSCAN, are presented to illustrate the advantages of our approach.

[1]  T. Chaundy HYPERGEOMETRIC PARTIAL DIFFERENTIAL EQUATIONS (I) , 1935 .

[2]  E. Parzen On Estimation of a Probability Density Function and Mode , 1962 .

[3]  J. MacQueen Some methods for classification and analysis of multivariate observations , 1967 .

[4]  S. P. Lloyd,et al.  Least squares quantization in PCM , 1982, IEEE Trans. Inf. Theory.

[5]  J. Sethian,et al.  Fronts propagating with curvature-dependent speed: algorithms based on Hamilton-Jacobi formulations , 1988 .

[6]  Keinosuke Fukunaga,et al.  Introduction to statistical pattern recognition (2nd ed.) , 1990 .

[7]  Peter J. Rousseeuw,et al.  Finding Groups in Data: An Introduction to Cluster Analysis , 1991 .

[8]  V. Caselles,et al.  A geometric model for active contours in image processing , 1993 .

[9]  Jiawei Han,et al.  Efficient and Effective Clustering Methods for Spatial Data Mining , 1994, VLDB.

[10]  D. W. Scott,et al.  Cross-Validation of Multivariate Densities , 1994 .

[11]  T. Chan,et al.  A Variational Level Set Approach to Multiphase Motion , 1996 .

[12]  Hans-Peter Kriegel,et al.  A Density-Based Algorithm for Discovering Clusters in Large Spatial Databases with Noise , 1996, KDD.

[13]  Pierre Hansen,et al.  Cluster analysis and mathematical programming , 1997, Math. Program..

[14]  Jiong Yang,et al.  STING: A Statistical Information Grid Approach to Spatial Data Mining , 1997, VLDB.

[15]  Dimitrios Gunopulos,et al.  Automatic subspace clustering of high dimensional data for data mining applications , 1998, SIGMOD '98.

[16]  Michael Ruogu Zhang,et al.  Comprehensive identification of cell cycle-regulated genes of the yeast Saccharomyces cerevisiae by microarray hybridization. , 1998, Molecular biology of the cell.

[17]  Daniel A. Keim,et al.  An Efficient Approach to Clustering in Large Multimedia Databases with Noise , 1998, KDD.

[18]  Hans-Peter Kriegel,et al.  OPTICS: ordering points to identify the clustering structure , 1999, SIGMOD '99.

[19]  Roded Sharan,et al.  Center CLICK: A Clustering Algorithm with Applications to Gene Expression Analysis , 2000, ISMB.

[20]  Aidong Zhang,et al.  WaveCluster: a wavelet-based clustering approach for spatial data in very large databases , 2000, The VLDB Journal.

[21]  Geoffrey J. McLachlan,et al.  Finite Mixture Models , 2019, Annual Review of Statistics and Its Application.

[22]  Hava T. Siegelmann,et al.  Support Vector Clustering , 2002, J. Mach. Learn. Res..

[23]  Michael Griebel,et al.  Data Mining with Sparse Grids , 2001, Computing.

[24]  I. Jolliffe Principal Component Analysis , 2002 .

[25]  Michael Griebel,et al.  Classification with sparse grids using simplicial basis functions , 2002, Intell. Data Anal..

[26]  Stanley Osher,et al.  Fast Sweeping Algorithms for a Class of Hamilton-Jacobi Equations , 2003, SIAM J. Numer. Anal..

[27]  Vipin Kumar,et al.  Finding Clusters of Different Sizes, Shapes, and Densities in Noisy, High Dimensional Data , 2003, SDM.

[28]  Ronald Fedkiw,et al.  Level set methods and dynamic implicit surfaces , 2002, Applied mathematical sciences.

[29]  Markus Hegland,et al.  Parallelisation of Sparse Grids for Large Scale Data Analysis , 2003, International Conference on Computational Science.

[30]  Hans-Peter Kriegel,et al.  Density-Based Clustering in Spatial Databases: The Algorithm GDBSCAN and Its Applications , 1998, Data Mining and Knowledge Discovery.

[31]  Inderjit S. Dhillon,et al.  Clustering with Bregman Divergences , 2005, J. Mach. Learn. Res..

[32]  Pavel Berkhin,et al.  A Survey of Clustering Data Mining Techniques , 2006, Grouping Multidimensional Data.

[33]  Anil K. Jain Fundamentals of Digital Image Processing , 2018, Control of Color Imaging Systems.