Gradient Diversity Empowers Distributed Learning: Convergence and Stability of Mini-batch SGD

It has been experimentally observed that distributed implementations of mini-batch stochastic gradient descent (SGD) algorithms exhibit speedup saturation and decaying generalization ability beyond a particular batch-size. In this work, we present an analysis hinting that high similarity between concurrently processed gradients may be a cause of this performance degradation. We introduce the notion of gradient diversity that measures the dissimilarity between concurrent gradient updates, and show its key role in the performance of mini-batch SGD. We prove that on problems with high gradient diversity, mini-batch SGD is amenable to better speedups, while maintaining the generalization performance of serial (one sample) SGD. We further establish lower bounds on convergence where mini-batch SGD slows down beyond a particular batch-size, solely due to the lack of gradient diversity. We provide experimental evidence indicating the key role of gradient diversity in distributed learning, and discuss how heuristics like dropout, Langevin dynamics, and quantization can improve it.

[1]  Deanna Needell,et al.  Batched Stochastic Gradient Descent with Weighted Sampling , 2016, ArXiv.

[2]  Nathan Srebro,et al.  Memory and Communication Efficient Distributed Stochastic Optimization with Minibatch Prox , 2017, COLT.

[3]  David W. Jacobs,et al.  Big Batch SGD: Automated Inference using Adaptive Batch Sizes , 2016, ArXiv.

[4]  Martin J. Wainwright,et al.  Communication-efficient algorithms for statistical optimization , 2012, 2012 IEEE 51st IEEE Conference on Decision and Control (CDC).

[5]  André Elisseeff,et al.  Stability and Generalization , 2002, J. Mach. Learn. Res..

[6]  Mark W. Schmidt,et al.  Linear Convergence of Gradient and Proximal-Gradient Methods Under the Polyak-Łojasiewicz Condition , 2016, ECML/PKDD.

[7]  Inderjit S. Dhillon,et al.  NOMAD: Nonlocking, stOchastic Multi-machine algorithm for Asynchronous and Decentralized matrix completion , 2013, Proc. VLDB Endow..

[8]  Yee Whye Teh,et al.  Bayesian Learning via Stochastic Gradient Langevin Dynamics , 2011, ICML.

[9]  Trishul M. Chilimbi,et al.  Project Adam: Building an Efficient and Scalable Deep Learning Training System , 2014, OSDI.

[10]  Michael I. Jordan,et al.  Estimation, Optimization, and Parallelism when Data is Sparse , 2013, NIPS.

[11]  Prateek Jain,et al.  Parallelizing Stochastic Gradient Descent for Least Squares Regression: Mini-batching, Averaging, and Model Misspecification , 2016, J. Mach. Learn. Res..

[12]  Hedvig Kjellström,et al.  Stochastic Learning on Imbalanced Data: Determinantal Point Processes for Mini-batch Diversification , 2017, ArXiv.

[13]  Geoffrey E. Hinton,et al.  ImageNet classification with deep convolutional neural networks , 2012, Commun. ACM.

[14]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[15]  Yuchen Zhang,et al.  DiSCO: Distributed Optimization for Self-Concordant Empirical Loss , 2015, ICML.

[16]  Yoram Singer,et al.  Train faster, generalize better: Stability of stochastic gradient descent , 2015, ICML.

[17]  Avleen Singh Bijral,et al.  Mini-Batch Primal and Dual Methods for SVMs , 2013, ICML.

[18]  Peter Richtárik,et al.  Distributed Mini-Batch SDCA , 2015, ArXiv.

[19]  Prateek Jain,et al.  Parallelizing Stochastic Approximation Through Mini-Batching and Tail-Averaging , 2016, ArXiv.

[20]  Saeed Ghadimi,et al.  Accelerated gradient methods for nonconvex nonlinear and stochastic programming , 2013, Math. Program..

[21]  Peter J. Haas,et al.  Large-scale matrix factorization with distributed stochastic gradient descent , 2011, KDD.

[22]  Francis R. Bach,et al.  Constant Step Size Least-Mean-Square: Bias-Variance Trade-offs and Optimal Sampling Distributions , 2014, AISTATS 2014.

[23]  Alexander J. Smola,et al.  Parallelized Stochastic Gradient Descent , 2010, NIPS.

[24]  Hedvig Kjellstrom,et al.  Determinantal Point Processes for Mini-Batch Diversification , 2017, UAI 2017.

[25]  Furong Huang,et al.  Escaping From Saddle Points - Online Stochastic Gradient for Tensor Decomposition , 2015, COLT.

[26]  Atsushi Nitanda,et al.  Stochastic Proximal Gradient Descent with Acceleration Techniques , 2014, NIPS.

[27]  Thomas Hofmann,et al.  Communication-Efficient Distributed Dual Coordinate Ascent , 2014, NIPS.

[28]  Jorge Nocedal,et al.  On Large-Batch Training for Deep Learning: Generalization Gap and Sharp Minima , 2016, ICLR.

[29]  Mark W. Schmidt,et al.  Hybrid Deterministic-Stochastic Methods for Data Fitting , 2011, SIAM J. Sci. Comput..

[30]  Ohad Shamir,et al.  Communication-Efficient Distributed Optimization using an Approximate Newton-type Method , 2013, ICML.

[31]  Samy Bengio,et al.  Revisiting Distributed Synchronous SGD , 2016, ArXiv.

[32]  Dacheng Tao,et al.  Algorithmic Stability and Hypothesis Complexity , 2017, ICML.

[33]  Jie Liu,et al.  Mini-Batch Semi-Stochastic Gradient Descent in the Proximal Setting , 2014, IEEE Journal of Selected Topics in Signal Processing.

[34]  Yann LeCun,et al.  Regularization of Neural Networks using DropConnect , 2013, ICML.

[35]  Ohad Shamir,et al.  Better Mini-Batch Algorithms via Accelerated Gradient Methods , 2011, NIPS.

[36]  Nitish Srivastava,et al.  Dropout: a simple way to prevent neural networks from overfitting , 2014, J. Mach. Learn. Res..

[37]  Tianbao Yang,et al.  Distributed Stochastic Variance Reduced Gradient Methods and A Lower Bound for Communication Complexity , 2015 .

[38]  Quoc V. Le,et al.  Adding Gradient Noise Improves Learning for Very Deep Networks , 2015, ArXiv.

[39]  Ohad Shamir,et al.  Optimal Distributed Online Prediction Using Mini-Batches , 2010, J. Mach. Learn. Res..

[40]  Zheng Zhang,et al.  MXNet: A Flexible and Efficient Machine Learning Library for Heterogeneous Distributed Systems , 2015, ArXiv.

[41]  Gideon S. Mann,et al.  Efficient Large-Scale Distributed Training of Conditional Maximum Entropy Models , 2009, NIPS.

[42]  Stephen J. Wright,et al.  Hogwild: A Lock-Free Approach to Parallelizing Stochastic Gradient Descent , 2011, NIPS.

[43]  Jorge Nocedal,et al.  Optimization Methods for Large-Scale Machine Learning , 2016, SIAM Rev..

[44]  Marc'Aurelio Ranzato,et al.  Large Scale Distributed Deep Networks , 2012, NIPS.

[45]  Ohad Shamir,et al.  Learnability, Stability and Uniform Convergence , 2010, J. Mach. Learn. Res..

[46]  Kaiming He,et al.  Accurate, Large Minibatch SGD: Training ImageNet in 1 Hour , 2017, ArXiv.

[47]  Alexander J. Smola,et al.  Efficient mini-batch training for stochastic optimization , 2014, KDD.

[48]  Martín Abadi,et al.  TensorFlow: Large-Scale Machine Learning on Heterogeneous Distributed Systems , 2016, ArXiv.

[49]  Stephen J. Wright,et al.  An asynchronous parallel stochastic coordinate descent algorithm , 2013, J. Mach. Learn. Res..

[50]  Shai Shalev-Shwartz,et al.  Accelerated Mini-Batch Stochastic Dual Coordinate Ascent , 2013, NIPS.

[51]  Ameet Talwalkar,et al.  Paleo: A Performance Model for Deep Neural Networks , 2016, ICLR.

[52]  Thorsten Joachims,et al.  Training linear SVMs in linear time , 2006, KDD '06.