Parallel Correlation Clustering on Big Graphs

Given a similarity graph between items, correlation clustering (CC) groups similar items together and dissimilar ones apart. One of the most popular CC algorithms is KwikCluster: an algorithm that serially clusters neighborhoods of vertices, and obtains a 3-approximation ratio. Unfortunately, KwikCluster in practice requires a large number of clustering rounds, a potential bottleneck for large graphs. We present C4 and ClusterWild!, two algorithms for parallel correlation clustering that run in a polylogarithmic number of rounds and achieve nearly linear speedups, provably. C4 uses concurrency control to enforce serializability of a parallel clustering process, and guarantees a 3-approximation ratio. ClusterWild! is a coordination free algorithm that abandons consistency for the benefit of better scaling; this leads to a provably small loss in the 3-approximation ratio. We provide extensive experimental results for both algorithms, where we outperform the state of the art, both in terms of clustering accuracy and running time. We show that our algorithms can cluster billion-edge graphs in under 5 seconds on 32 cores, while achieving a 15x speedup.

[1]  Ron Shamir,et al.  Clustering Gene Expression Patterns , 1999, J. Comput. Biol..

[2]  Nikhil Bansal,et al.  Correlation Clustering , 2002, The 43rd Annual IEEE Symposium on Foundations of Computer Science, 2002. Proceedings..

[3]  Moses Charikar,et al.  Maximizing quadratic programs: extending Grothendieck's inequality , 2004, 45th Annual IEEE Symposium on Foundations of Computer Science.

[4]  Sebastiano Vigna,et al.  UbiCrawler: a scalable fully distributed Web crawler , 2004, Softw. Pract. Exp..

[5]  Sebastiano Vigna,et al.  The webgraph framework I: compression techniques , 2004, WWW '04.

[6]  Chaitanya Swamy,et al.  Correlation Clustering: maximizing agreements via semidefinite programming , 2004, SODA '04.

[7]  M. Charikar,et al.  Aggregating inconsistent information: ranking and clustering , 2005, STOC '05.

[8]  Noga Alon,et al.  Quadratic forms on graphs , 2005, STOC '05.

[9]  Venkatesan Guruswami,et al.  Clustering with qualitative information , 2005, 44th Annual IEEE Symposium on Foundations of Computer Science, 2003. Proceedings..

[10]  Venkatesan Guruswami,et al.  Correlation clustering with a fixed number of clusters , 2005, SODA '06.

[11]  Amos Fiat,et al.  Correlation clustering in general weighted graphs , 2006, Theor. Comput. Sci..

[12]  Ahmed K. Elmagarmid,et al.  Duplicate Record Detection: A Survey , 2007, IEEE Transactions on Knowledge and Data Engineering.

[13]  Jiming Liu,et al.  Community Mining from Signed Social Networks , 2007, IEEE Transactions on Knowledge and Data Engineering.

[14]  Renée J. Miller,et al.  Framework for Evaluating Clustering Algorithms in Duplicate Detection , 2009, Proc. VLDB Endow..

[15]  M. Elsner,et al.  Bounding and Comparing Methods for Correlation Clustering Beyond ILP , 2009, ILP 2009.

[16]  Christopher Ré,et al.  Large-Scale Deduplication with Constraints Using Dedupalog , 2009, 2009 IEEE 25th International Conference on Data Engineering.

[17]  Aristides Gionis,et al.  Overlapping correlation clustering , 2011, 2011 IEEE 11th International Conference on Data Mining.

[18]  Marco Rosa,et al.  Layered label propagation: a multiresolution coordinate-free ordering for compressing social networks , 2010, WWW.

[19]  Claudio Gentile,et al.  A Correlation Clustering Approach to Link Classification in Signed Networks , 2012, COLT.

[20]  Francesco Gullo,et al.  Chromatic correlation clustering , 2012, KDD.

[21]  Guy E. Blelloch,et al.  Greedy sequential maximal independent set and matching are parallel on average , 2012, SPAA '12.

[22]  Michael I. Jordan,et al.  Optimistic Concurrency Control for Distributed Unsupervised Learning , 2013, NIPS.

[23]  Edo Liberty,et al.  Correlation clustering: from theory to practice , 2014, KDD.

[24]  Ravi Kumar,et al.  Correlation clustering in MapReduce , 2014, KDD.

[25]  Tselil Schramm,et al.  Near Optimal LP Rounding Algorithm for CorrelationClustering on Complete and Complete k-partite Graphs , 2014, STOC.

[26]  Aravindan Vijayaraghavan,et al.  Correlation Clustering with Noisy Partial Information , 2014, COLT.

[27]  Olgica Milenkovic,et al.  Correlation Clustering with Constrained Cluster Sizes and Extended Weights Bounds , 2014, SIAM J. Optim..

[28]  Michael Krivelevich,et al.  The Phase Transition in Site Percolation on Pseudo-Random Graphs , 2014, Electron. J. Comb..