Optimistic Concurrency Control for Distributed Unsupervised Learning

Research on distributed machine learning algorithms has focused primarily on one of two extremes—algorithms that obey strict concurrency constraints or algorithms that obey few or no such constraints. We consider an intermediate alternative in which algorithms optimistically assume that conflicts are unlikely and if conflicts do arise a conflict-resolution protocol is invoked. We view this "optimistic concurrency control" paradigm as particularly appropriate for large-scale machine learning algorithms, particularly in the unsupervised setting. We demonstrate our approach in three problem areas: clustering, feature learning and online facility location. We evaluate our methods via large-scale experiments in a cluster computing environment.

[1]  J. T. Robinson,et al.  On optimistic methods for concurrency control , 1979, TODS.

[2]  Leslie G. Valiant,et al.  A bridging model for parallel computation , 1990, CACM.

[3]  Inderjit S. Dhillon,et al.  A Data-Clustering Algorithm on Distributed Memory Multiprocessors , 1999, Large-Scale Parallel Data Mining.

[4]  Adam Meyerson,et al.  Online facility location , 2001, Proceedings 2001 IEEE International Conference on Cluster Computing.

[5]  Piotr Indyk,et al.  Approximate clustering via core-sets , 2002, STOC '02.

[6]  Rina Panigrahy,et al.  Better streaming algorithms for clustering problems , 2003, STOC '03.

[7]  Sudipto Guha,et al.  Clustering Data Streams: Theory and Practice , 2003, IEEE Trans. Knowl. Data Eng..

[8]  Abhinandan Das,et al.  Google news personalization: scalable online collaborative filtering , 2007, WWW '07.

[9]  Max Welling,et al.  Distributed Inference for Latent Dirichlet Allocation , 2007, NIPS.

[10]  Nir Ailon,et al.  Streaming k-means approximation , 2009, NIPS.

[11]  Zoubin Ghahramani,et al.  Large Scale Nonparametric Bayesian Inference: Data Parallelisation in the Indian Buffet Process , 2009, NIPS.

[12]  Scott Shenker,et al.  Spark: Cluster Computing with Working Sets , 2010, HotCloud.

[13]  Alexander T. Ihler,et al.  Multicore Gibbs Sampling in Dense, Unstructured Graphs , 2011, AISTATS.

[14]  Andreas Krause,et al.  Scalable Training of Mixture Models via Coresets , 2011, NIPS.

[15]  Stephen J. Wright,et al.  Hogwild: A Lock-Free Approach to Parallelizing Stochastic Gradient Descent , 2011, NIPS.

[16]  Benjamin Moseley,et al.  Fast clustering using MapReduce , 2011, KDD.

[17]  Arthur Gretton,et al.  Parallel Gibbs Sampling: From Colored Fields to Thin Junction Trees , 2011, AISTATS.

[18]  Adam Meyerson,et al.  Fast and Accurate k-means For Large Datasets , 2011, NIPS.

[19]  Sergei Vassilvitskii,et al.  Scalable K-Means++ , 2012, Proc. VLDB Endow..

[20]  Carlos Guestrin,et al.  Distributed GraphLab : A Framework for Machine Learning and Data Mining in the Cloud , 2012 .

[21]  Alexander J. Smola,et al.  Scalable inference in latent variable models , 2012, WSDM '12.

[22]  Michael I. Jordan,et al.  Revisiting k-means: New Algorithms via Bayesian Nonparametrics , 2011, ICML.

[23]  Michael I. Jordan,et al.  Stick-Breaking Beta Processes and the Poisson Process , 2012, AISTATS.

[24]  Ryan P. Adams,et al.  ClusterCluster: Parallel Markov Chain Monte Carlo for Dirichlet Process Mixtures , 2013, ArXiv.

[25]  Michael I. Jordan,et al.  MAD-Bayes: MAP-based Asymptotic Derivations from Bayes , 2012, ICML.