Random Shuffling Beats SGD after Finite Epochs

A long-standing problem in the theory of stochastic gradient descent (SGD) is to prove that its without-replacement version RandomShuffle converges faster than the usual with-replacement version. We present the first (to our knowledge) non-asymptotic solution to this problem, which shows that after a "reasonable" number of epochs RandomShuffle indeed converges faster than SGD. Specifically, we prove that under strong convexity and second-order smoothness, the sequence generated by RandomShuffle converges to the optimal solution at the rate O(1/T^2 + n^3/T^3), where n is the number of components in the objective, and T is the total number of iterations. This result shows that after a reasonable number of epochs RandomShuffle is strictly better than SGD (which converges as O(1/T)). The key step toward showing this better dependence on T is the introduction of n into the bound; and as our analysis will show, in general a dependence on n is unavoidable without further changes to the algorithm. We show that for sparse data RandomShuffle has the rate O(1/T^2), again strictly better than SGD. Furthermore, we discuss extensions to nonconvex gradient dominated functions, as well as non-strongly convex settings.

[1]  Dimitri P. Bertsekas,et al.  Incremental Gradient, Subgradient, and Proximal Methods for Convex Optimization: A Survey , 2015, ArXiv.

[2]  Tom Goldstein,et al.  Efficient Distributed SGD with Variance Reduction , 2015, 2016 IEEE 16th International Conference on Data Mining (ICDM).

[3]  Yurii Nesterov,et al.  Cubic regularization of Newton method and its global performance , 2006, Math. Program..

[4]  F. Krahmer,et al.  An arithmetic–geometric mean inequality for products of three matrices , 2014, 1411.0333.

[5]  L. Bottou Curiously Fast Convergence of some Stochastic Gradient Descent Algorithms , 2009 .

[6]  Ohad Shamir,et al.  Dimension-Free Iteration Complexity of Finite Sum Optimization Problems , 2016, NIPS.

[7]  Ohad Shamir,et al.  Making Gradient Descent Optimal for Strongly Convex Stochastic Optimization , 2011, ICML.

[8]  John Darzentas,et al.  Problem Complexity and Method Efficiency in Optimization , 1983 .

[9]  Paul Tseng,et al.  An Incremental Gradient(-Projection) Method with Momentum Term and Adaptive Stepsize Rule , 1998, SIAM J. Optim..

[10]  Asuman E. Ozdaglar,et al.  Why random reshuffling beats stochastic gradient descent , 2015, Mathematical Programming.

[11]  Yurii Nesterov,et al.  Introductory Lectures on Convex Optimization - A Basic Course , 2014, Applied Optimization.

[12]  Shai Shalev-Shwartz,et al.  Stochastic dual coordinate ascent methods for regularized loss , 2012, J. Mach. Learn. Res..

[13]  Ali H. Sayed,et al.  Stochastic Learning under Random Reshuffling , 2018, ArXiv.

[14]  Boris Polyak Gradient methods for the minimisation of functionals , 1963 .

[15]  Francis Bach,et al.  SAGA: A Fast Incremental Gradient Method With Support for Non-Strongly Convex Composite Objectives , 2014, NIPS.

[16]  Ruoyu Sun,et al.  Worst-case complexity of cyclic coordinate descent: $$O(n^2)$$ O ( n 2 ) , 2016, Mathematical Programming.

[17]  Mikhail V. Solodov,et al.  Incremental Gradient Algorithms with Stepsizes Bounded Away from Zero , 1998, Comput. Optim. Appl..

[18]  Ali H. Sayed,et al.  Stochastic Learning Under Random Reshuffling With Constant Step-Sizes , 2018, IEEE Transactions on Signal Processing.

[19]  Mark W. Schmidt,et al.  Fast and Faster Convergence of SGD for Over-Parameterized Models and an Accelerated Perceptron , 2018, AISTATS.

[20]  Stephen J. Wright,et al.  Optimization for Machine Learning , 2013 .

[21]  A. Ozdaglar,et al.  Convergence Rate of Incremental Gradient and Newton Methods , 2015 .

[22]  Asuman E. Ozdaglar,et al.  When Cyclic Coordinate Descent Outperforms Randomized Coordinate Descent , 2017, NIPS.

[23]  Teuvo Kohonen,et al.  An Adaptive Associative Memory Principle , 1974, IEEE Transactions on Computers.

[24]  Peter Richtárik,et al.  SGD and Hogwild! Convergence Without the Bounded Gradients Assumption , 2018, ICML.

[25]  Eric Moulines,et al.  Non-Asymptotic Analysis of Stochastic Approximation Algorithms for Machine Learning , 2011, NIPS.

[26]  Shai Ben-David,et al.  Understanding Machine Learning: From Theory to Algorithms , 2014 .

[27]  Alexander J. Smola,et al.  Stochastic Variance Reduction for Nonconvex Optimization , 2016, ICML.

[28]  Ohad Shamir,et al.  Without-Replacement Sampling for Stochastic Gradient Methods , 2016, NIPS.

[29]  D. Bertsekas,et al.  Convergen e Rate of In remental Subgradient Algorithms , 2000 .

[30]  Christopher Ré,et al.  Towards a unified architecture for in-RDBMS analytics , 2012, SIGMOD Conference.

[31]  Prateek Jain,et al.  SGD without Replacement: Sharper Rates for General Smooth Convex Functions , 2019, ICML.

[32]  Stephen J. Wright,et al.  Hogwild: A Lock-Free Approach to Parallelizing Stochastic Gradient Descent , 2011, NIPS.

[33]  Stephen J. Wright,et al.  Analyzing random permutations for cyclic coordinate descent , 2020, Math. Comput..

[34]  Jorge Nocedal,et al.  Optimization Methods for Large-Scale Machine Learning , 2016, SIAM Rev..

[35]  Stephen J. Wright,et al.  Random permutations fix a worst case for cyclic coordinate descent , 2016, IMA Journal of Numerical Analysis.

[36]  Léon Bottou,et al.  Stochastic Gradient Descent Tricks , 2012, Neural Networks: Tricks of the Trade.

[37]  Alexander Shapiro,et al.  Stochastic Approximation approach to Stochastic Programming , 2013 .

[38]  Justin Domke,et al.  Finito: A faster, permutable incremental gradient method for big data problems , 2014, ICML.

[39]  Tong Zhang,et al.  Accelerating Stochastic Gradient Descent using Predictive Variance Reduction , 2013, NIPS.

[40]  Tianbao Yang,et al.  Distributed Stochastic Variance Reduced Gradient Methods and A Lower Bound for Communication Complexity , 2015 .

[41]  Mark W. Schmidt,et al.  Fast Convergence of Stochastic Gradient Descent under a Strong Growth Condition , 2013, 1308.6370.

[42]  Tengzhou Zhang A note on the non-commutative arithmetic-geometric mean inequality , 2014, 1411.5058.

[43]  Elad Hazan,et al.  An optimal algorithm for stochastic strongly-convex optimization , 2010, 1006.2425.

[44]  B. Recht,et al.  Beneath the valley of the noncommutative arithmetic-geometric mean inequality: conjectures, case-studies, and consequences , 2012, 1202.4184.