Variance Reduction with Sparse Gradients

Variance reduction methods such as SVRG and SpiderBoost use a mixture of large and small batch gradients to reduce the variance of stochastic gradients. Compared to SGD, these methods require at least double the number of operations per update to model parameters. To reduce the computational cost of these methods, we introduce a new sparsity operator: The random-top-k operator. Our operator reduces computational complexity by estimating gradient sparsity exhibited in a variety of applications by combining the top-k operator and the randomized coordinate descent operator. With this operator, large batch gradients offer an extra benefit beyond variance reduction: A reliable estimate of gradient sparsity. Theoretically, our algorithm is at least as good as the best algorithm (SpiderBoost), and further excels in performance whenever the random-top-k operator captures gradient sparsity. Empirically, our algorithm consistently outperforms SpiderBoost using various models on various tasks including image classification, natural language processing, and sparse matrix factorization. We also provide empirical evidence to support the intuition behind our algorithm via a simple gradient entropy computation, which serves to quantify gradient sparsity at every iteration.

[1]  Quoc V. Le,et al.  Don't Decay the Learning Rate, Increase the Batch Size , 2017, ICLR.

[2]  Marten van Dijk,et al.  Finite-sum smooth optimization with SARAH , 2019, Computational Optimization and Applications.

[3]  Michael I. Jordan,et al.  On the Adaptivity of Stochastic Gradient-Based Optimization , 2019, SIAM J. Optim..

[4]  Martin Jaggi,et al.  Sparsified SGD with Memory , 2018, NeurIPS.

[5]  Jürgen Schmidhuber,et al.  Long Short-Term Memory , 1997, Neural Computation.

[6]  H. Robbins A Stochastic Approximation Method , 1951 .

[7]  Zeyuan Allen Zhu,et al.  Katyusha: the first direct acceleration of stochastic gradient methods , 2017, STOC.

[8]  Ann Bies,et al.  The Penn Treebank: Annotating Predicate Argument Structure , 1994, HLT.

[9]  Mark W. Schmidt,et al.  A Stochastic Gradient Method with an Exponential Convergence Rate for Finite Training Sets , 2012, NIPS.

[10]  Michael I. Jordan,et al.  Less than a Single Pass: Stochastically Controlled Stochastic Gradient , 2016, AISTATS.

[11]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[12]  Yves Chauvin,et al.  Backpropagation: theory, architectures, and applications , 1995 .

[13]  Alexander J. Smola,et al.  Fast Incremental Method for Nonconvex Optimization , 2016, ArXiv.

[14]  Francis Bach,et al.  SAGA: A Fast Incremental Gradient Method With Support for Non-Strongly Convex Composite Objectives , 2014, NIPS.

[15]  Yi Zhou,et al.  SpiderBoost: A Class of Faster Variance-reduced Algorithms for Nonconvex Optimization , 2018, ArXiv.

[16]  Tong Zhang,et al.  Accelerating Stochastic Gradient Descent using Predictive Variance Reduction , 2013, NIPS.

[17]  Zeyuan Allen Zhu,et al.  Improved SVRG for Non-Strongly-Convex or Sum-of-Non-Convex Objectives , 2015, ICML.

[18]  Luca Antiga,et al.  Automatic differentiation in PyTorch , 2017 .

[19]  Michael I. Jordan,et al.  Non-convex Finite-Sum Optimization Via SCSG Methods , 2017, NIPS.

[20]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[21]  Kenneth Heafield,et al.  Sparse Communication for Distributed Gradient Descent , 2017, EMNLP.

[22]  Léon Bottou,et al.  On the Ineffectiveness of Variance Reduced Optimization for Deep Learning , 2018, NeurIPS.

[23]  Andrew Y. Ng,et al.  Reading Digits in Natural Images with Unsupervised Feature Learning , 2011 .

[24]  Tong Zhang,et al.  SPIDER: Near-Optimal Non-Convex Optimization via Stochastic Path Integrated Differential Estimator , 2018, NeurIPS.

[25]  Elad Hoffer,et al.  Train longer, generalize better: closing the generalization gap in large batch training of neural networks , 2017, NIPS.

[26]  Lam M. Nguyen,et al.  ProxSARAH: An Efficient Algorithmic Framework for Stochastic Composite Nonconvex Optimization , 2019, J. Mach. Learn. Res..

[27]  F. Maxwell Harper,et al.  The MovieLens Datasets: History and Context , 2016, TIIS.

[28]  Zeyuan Allen Zhu,et al.  Variance Reduction for Faster Non-Convex Optimization , 2016, ICML.

[29]  Lin Xiao,et al.  A Proximal Stochastic Gradient Method with Progressive Variance Reduction , 2014, SIAM J. Optim..

[30]  Zeyuan Allen-Zhu,et al.  Natasha 2: Faster Non-Convex Optimization Than SGD , 2017, NeurIPS.

[31]  Alexander Shapiro,et al.  Stochastic Approximation approach to Stochastic Programming , 2013 .

[32]  Quanquan Gu,et al.  Stochastic Nested Variance Reduction for Nonconvex Optimization , 2018, J. Mach. Learn. Res..

[33]  David R. Musser,et al.  Introspective Sorting and Selection Algorithms , 1997, Softw. Pract. Exp..

[34]  Lars Schmidt-Thieme,et al.  BPR: Bayesian Personalized Ranking from Implicit Feedback , 2009, UAI.

[35]  Yoram Singer,et al.  Adaptive Subgradient Methods for Online Learning and Stochastic Optimization , 2011, J. Mach. Learn. Res..

[36]  Marten van Dijk,et al.  Optimal Finite-Sum Smooth Non-Convex Optimization with SARAH , 2019, ArXiv.

[37]  Martín Abadi,et al.  TensorFlow: Large-Scale Machine Learning on Heterogeneous Distributed Systems , 2016, ArXiv.

[38]  Alexander J. Smola,et al.  Stochastic Variance Reduction for Nonconvex Optimization , 2016, ICML.

[39]  Zeyuan Allen-Zhu,et al.  Katyusha X: Practical Momentum Method for Stochastic Sum-of-Nonconvex Optimization , 2018, ICML.