Similarity clustering in the presence of outliers: Exact recovery via convex program

We study the problem of clustering a set of data points based on their similarity matrix, each entry of which represents the similarity between the corresponding pair of points. We propose a convex-optimization-based algorithm for clustering using the similarity matrix, which has provable recovery guarantees. It needs no prior knowledge of the number of clusters and it behaves in a robust way in the presence of outliers and noise. Using a generative stochastic model for the similarity matrix (which can be thought of as a generalization of the classical Stochastic Block Model) we obtain precise bounds (not orderwise) on the sizes of the clusters, the number of outliers, the noise variance, separation between the mean similarities inside and outside the clusters and the values of the regularization parameter that guarantee the exact recovery of the clusters with high probability. The theoretical findings are corroborated with extensive evidence from simulations.

[1]  Babak Hassibi,et al.  Finding Dense Clusters via "Low Rank + Sparse" Decomposition , 2011, ArXiv.

[2]  Babak Hassibi,et al.  Sharp performance bounds for graph clustering via convex optimization , 2014, 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[3]  Bruce E. Hajek,et al.  Submatrix localization via message passing , 2015, J. Mach. Learn. Res..

[4]  Yudong Chen,et al.  Clustering Partially Observed Graphs via Convex Optimization , 2011, ICML.

[5]  Stephen A. Vavasis,et al.  Convex optimization for the planted k-disjoint-clique problem , 2014, Math. Program..

[6]  Genevera I. Allen,et al.  Convex biclustering , 2014, Biometrics.

[7]  Ali Jalali,et al.  Low-Rank Matrix Recovery From Errors and Erasures , 2011, IEEE Transactions on Information Theory.

[8]  Julia Couto,et al.  Kernel K-Means for Categorical Data , 2005, IDA.

[9]  Gary K. Chen,et al.  Convex Clustering: An Attractive Alternative to Hierarchical Clustering , 2014, PLoS Comput. Biol..

[10]  Francis R. Bach,et al.  Clusterpath: an Algorithm for Clustering using Convex Fusion Penalties , 2011, ICML.

[11]  Bin Yu,et al.  Spectral clustering and the high-dimensional stochastic blockmodel , 2010, 1007.1684.

[12]  Vipin Kumar,et al.  Introduction to Data Mining, (First Edition) , 2005 .

[13]  L. Ljung,et al.  Clustering using sum-of-norms regularization: With application to particle filter output computation , 2011, 2011 IEEE Statistical Signal Processing Workshop (SSP).

[14]  Sivaraman Balakrishnan,et al.  Noise Thresholds for Spectral Clustering , 2011, NIPS.

[15]  Ivan Marsic,et al.  From Categorical to Numerical: Multiple Transitive Distance Learning and Embedding , 2015, SDM.

[16]  Frank McSherry,et al.  Spectral partitioning of random graphs , 2001, Proceedings 2001 IEEE International Conference on Cluster Computing.

[17]  Anil K. Jain,et al.  Data clustering: a review , 1999, CSUR.

[18]  Yudong Chen,et al.  Statistical-Computational Tradeoffs in Planted Problems and Submatrix Localization with a Growing Number of Clusters and Submatrices , 2014, J. Mach. Learn. Res..

[19]  Babak Hassibi,et al.  Graph Clustering With Missing Data: Convex Algorithms and Analysis , 2014, NIPS.

[20]  Kathryn B. Laskey,et al.  Stochastic blockmodels: First steps , 1983 .

[21]  Sujay Sanghavi,et al.  Clustering Sparse Graphs , 2012, NIPS.

[22]  Brendan P. W. Ames Guaranteed clustering and biclustering via semidefinite programming , 2012, Math. Program..

[23]  Constantine Caramanis,et al.  Robust PCA via Outlier Pursuit , 2010, IEEE Transactions on Information Theory.

[24]  Michael I. Jordan,et al.  On Spectral Clustering: Analysis and an algorithm , 2001, NIPS.

[25]  Stephen A. Vavasis,et al.  Nuclear norm minimization for the planted clique and biclique problems , 2009, Math. Program..