Parallel and Distributed Block-Coordinate Frank-Wolfe Algorithms

We study parallel and distributed Frank-Wolfe algorithms; the former on shared memory machines with mini-batching, and the latter in a delayed update framework. In both cases, we perform computations asynchronously whenever possible. We assume block-separable constraints as in Block-Coordinate Frank-Wolfe (BCFW) method (Lacoste-Julien et al., 2013), but our analysis subsumes BCFW and reveals problemdependent quantities that govern the speedups of our methods over BCFW. A notable feature of our algorithms is that they do not depend on worst-case bounded delays, but only (mildly) on expected delays, making them robust to stragglers and faulty worker threads. We present experiments on structural SVM and Group Fused Lasso, and observe significant speedups over competing state-of-the-art (and synchronous) methods.

[1]  Philip Wolfe,et al.  An algorithm for quadratic programming , 1956 .

[2]  Larry J. LeBlanc,et al.  AN EFFICIENT APPROACH TO SOLVING THE ROAD NETWORK EQUILIBRIUM TRAFFIC ASSIGNMENT PROBLEM. IN: THE AUTOMOBILE , 1975 .

[3]  John N. Tsitsiklis,et al.  Distributed Asynchronous Deterministic and Stochastic Gradient Optimization Algorithms , 1984, 1984 American Control Conference.

[4]  Martin Raab,et al.  "Balls into Bins" - A Simple and Tight Analysis , 1998, RANDOM.

[5]  Michael Mitzenmacher,et al.  The Power of Two Choices in Randomized Load Balancing , 2001, IEEE Trans. Parallel Distributed Syst..

[6]  Ben Taskar,et al.  Max-Margin Markov Networks , 2003, NIPS.

[7]  Kenneth L. Clarkson,et al.  Coresets, sparse greedy approximation, and the Frank-Wolfe algorithm , 2008, SODA '08.

[8]  Peter L. Bartlett,et al.  Exponentiated Gradient Algorithms for Conditional Random Fields and Max-Margin Markov Networks , 2008, J. Mach. Learn. Res..

[9]  Peng Sun,et al.  Linear convergence of a modified Frank–Wolfe algorithm for computing minimum-volume enclosing ellipsoids , 2008, Optim. Methods Softw..

[10]  Dirk A. Lorenz,et al.  A generalized conditional gradient method and its connection to an iterative shrinkage method , 2009, Comput. Optim. Appl..

[11]  Jieping Ye,et al.  Tensor Completion for Estimating Missing Values in Visual Data , 2009, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[12]  Thorsten Joachims,et al.  Learning structural SVMs with latent variables , 2009, ICML '09.

[13]  S. Fujishige,et al.  A Submodular Function Minimization Algorithm Based on the Minimum-Norm Base ⁄ , 2009 .

[14]  Alexander G. Gray,et al.  Fast Stochastic Frank-Wolfe Algorithms for Nonlinear SVMs , 2010, SDM.

[15]  Stephen J. Wright,et al.  Hogwild: A Lock-Free Approach to Parallelizing Stochastic Gradient Descent , 2011, NIPS.

[16]  Gary L. Miller,et al.  A Nearly-m log n Time Solver for SDD Linear Systems , 2011, 2011 IEEE 52nd Annual Symposium on Foundations of Computer Science.

[17]  Martin Jaggi,et al.  Sparse Convex Optimization Methods for Machine Learning , 2011 .

[18]  Jean-Philippe Vert,et al.  The group fused Lasso for multiple change-point detection , 2011, 1106.4199.

[19]  Yurii Nesterov,et al.  Efficiency of Coordinate Descent Methods on Huge-Scale Optimization Problems , 2012, SIAM J. Optim..

[20]  Elad Hazan,et al.  Projection-free Online Learning , 2012, ICML.

[21]  John D. Lafferty,et al.  Nonparametric Reduced Rank Regression , 2012, NIPS.

[22]  Yaoliang Yu,et al.  Accelerated Training for Matrix-norm Regularization: A Boosting Approach , 2012, NIPS.

[23]  Amir Beck,et al.  On the Convergence of Block Coordinate Descent Type Methods , 2013, SIAM J. Optim..

[24]  Seunghak Lee,et al.  More Effective Distributed ML via a Stale Synchronous Parallel Parameter Server , 2013, NIPS.

[25]  Zeyuan Allen Zhu,et al.  A simple, combinatorial algorithm for solving SDD systems in nearly-linear time , 2013, STOC '13.

[26]  Aaron Q. Li,et al.  Parameter Server for Distributed Machine Learning , 2013 .

[27]  José R. Dorronsoro,et al.  Group Fused Lasso , 2013, ICANN.

[28]  Seunghak Lee,et al.  Petuum: A Framework for Iterative-Convergent Distributed ML , 2013, ArXiv.

[29]  Martin Jaggi,et al.  An Affine Invariant Linear Convergence Analysis for Frank-Wolfe Algorithms , 2013, 1312.7864.

[30]  Martin Jaggi,et al.  Revisiting Frank-Wolfe: Projection-Free Sparse Convex Optimization , 2013, ICML.

[31]  Mark W. Schmidt,et al.  Block-Coordinate Frank-Wolfe Optimization for Structural SVMs , 2012, ICML.

[32]  Elad Hazan,et al.  A Linearly Convergent Conditional Gradient Algorithm with Applications to Online and Stochastic Optimization , 2013, 1301.4666.

[33]  Suvrit Sra,et al.  Reflection methods for user-friendly submodular optimization , 2013, NIPS.

[34]  Yaoliang Yu,et al.  Polar Operators for Structured Sparse Estimation , 2013, NIPS.

[35]  Maria-Florina Balcan,et al.  Distributed Frank-Wolfe Algorithm: A Unified Framework for Communication-Efficient Sparse Learning , 2014, ArXiv.

[36]  Seunghak Lee,et al.  On Model Parallelization and Scheduling Strategies for Distributed Machine Learning , 2014, NIPS.

[37]  Yaoliang Yu,et al.  Petuum: A New Platform for Distributed Machine Learning on Big Data , 2013, IEEE Transactions on Big Data.

[38]  Eric Moulines,et al.  Convergence Analysis of a Stochastic Projection-free Algorithm , 2015, ArXiv.

[39]  Stephen J. Wright,et al.  An asynchronous parallel stochastic coordinate descent algorithm , 2013, J. Mach. Learn. Res..

[40]  Maria-Florina Balcan,et al.  A Distributed Frank-Wolfe Algorithm for Communication-Efficient Sparse Learning , 2014, SDM.

[41]  Peter Richtárik,et al.  Accelerated, Parallel, and Proximal Coordinate Descent , 2013, SIAM J. Optim..

[42]  Martin Jaggi,et al.  On the Global Linear Convergence of Frank-Wolfe Optimization Variants , 2015, NIPS.

[43]  Eric P. Xing,et al.  High-Performance Distributed ML at Scale through Parameter Server Consistency Models , 2014, AAAI.

[44]  Zaïd Harchaoui,et al.  Conditional gradient algorithms for norm-regularized smooth convex optimization , 2013, Math. Program..

[45]  Paul Grigas,et al.  New analysis and results for the Frank–Wolfe method , 2013, Math. Program..

[46]  Peter Richtárik,et al.  Coordinate descent with arbitrary sampling II: expected separable overapproximation , 2014, Optim. Methods Softw..

[47]  Peter Richtárik,et al.  Parallel coordinate descent methods for big data optimization , 2012, Mathematical Programming.