论文信息 - Random Shuffling Beats SGD after Finite Epochs - 字舞流文

Random Shuffling Beats SGD after Finite Epochs

A long-standing problem in the theory of stochastic gradient descent (SGD) is to prove that its without-replacement version RandomShuffle converges faster than the usual with-replacement version. We present the first (to our knowledge) non-asymptotic solution to this problem, which shows that after a "reasonable" number of epochs RandomShuffle indeed converges faster than SGD. Specifically, we prove that under strong convexity and second-order smoothness, the sequence generated by RandomShuffle converges to the optimal solution at the rate O(1/T^2 + n^3/T^3), where n is the number of components in the objective, and T is the total number of iterations. This result shows that after a reasonable number of epochs RandomShuffle is strictly better than SGD (which converges as O(1/T)). The key step toward showing this better dependence on T is the introduction of n into the bound; and as our analysis will show, in general a dependence on n is unavoidable without further changes to the algorithm. We show that for sparse data RandomShuffle has the rate O(1/T^2), again strictly better than SGD. Furthermore, we discuss extensions to nonconvex gradient dominated functions, as well as non-strongly convex settings.

Suvrit Sra | Jeff Haochen | Jeff Z. HaoChen | S. Sra

[1] Dimitri P. Bertsekas,et al. Incremental Gradient, Subgradient, and Proximal Methods for Convex Optimization: A Survey , 2015, ArXiv.

[2] Tom Goldstein,et al. Efficient Distributed SGD with Variance Reduction , 2015, 2016 IEEE 16th International Conference on Data Mining (ICDM).

[3] Yurii Nesterov,et al. Cubic regularization of Newton method and its global performance , 2006, Math. Program..

[4] F. Krahmer,et al. An arithmetic–geometric mean inequality for products of three matrices , 2014, 1411.0333.

[5] L. Bottou. Curiously Fast Convergence of some Stochastic Gradient Descent Algorithms , 2009 .

[6] Ohad Shamir,et al. Dimension-Free Iteration Complexity of Finite Sum Optimization Problems , 2016, NIPS.

[7] Ohad Shamir,et al. Making Gradient Descent Optimal for Strongly Convex Stochastic Optimization , 2011, ICML.

[8] John Darzentas,et al. Problem Complexity and Method Efficiency in Optimization , 1983 .

[9] Paul Tseng,et al. An Incremental Gradient(-Projection) Method with Momentum Term and Adaptive Stepsize Rule , 1998, SIAM J. Optim..

[10] Asuman E. Ozdaglar,et al. Why random reshuffling beats stochastic gradient descent , 2015, Mathematical Programming.

[11] Yurii Nesterov,et al. Introductory Lectures on Convex Optimization - A Basic Course , 2014, Applied Optimization.

[12] Shai Shalev-Shwartz,et al. Stochastic dual coordinate ascent methods for regularized loss , 2012, J. Mach. Learn. Res..

[13] Ali H. Sayed,et al. Stochastic Learning under Random Reshuffling , 2018, ArXiv.

[14] Boris Polyak. Gradient methods for the minimisation of functionals , 1963 .

[15] Francis Bach,et al. SAGA: A Fast Incremental Gradient Method With Support for Non-Strongly Convex Composite Objectives , 2014, NIPS.

[16] Ruoyu Sun,et al. Worst-case complexity of cyclic coordinate descent: $$O(n^2)$$ O ( n 2 ) , 2016, Mathematical Programming.

[17] Mikhail V. Solodov,et al. Incremental Gradient Algorithms with Stepsizes Bounded Away from Zero , 1998, Comput. Optim. Appl..

[18] Ali H. Sayed,et al. Stochastic Learning Under Random Reshuffling With Constant Step-Sizes , 2018, IEEE Transactions on Signal Processing.

[19] Mark W. Schmidt,et al. Fast and Faster Convergence of SGD for Over-Parameterized Models and an Accelerated Perceptron , 2018, AISTATS.

[20] Stephen J. Wright,et al. Optimization for Machine Learning , 2013 .

[21] A. Ozdaglar,et al. Convergence Rate of Incremental Gradient and Newton Methods , 2015 .

[22] Asuman E. Ozdaglar,et al. When Cyclic Coordinate Descent Outperforms Randomized Coordinate Descent , 2017, NIPS.

[23] Teuvo Kohonen,et al. An Adaptive Associative Memory Principle , 1974, IEEE Transactions on Computers.

[24] Peter Richtárik,et al. SGD and Hogwild! Convergence Without the Bounded Gradients Assumption , 2018, ICML.

[25] Eric Moulines,et al. Non-Asymptotic Analysis of Stochastic Approximation Algorithms for Machine Learning , 2011, NIPS.

[26] Shai Ben-David,et al. Understanding Machine Learning: From Theory to Algorithms , 2014 .

[27] Alexander J. Smola,et al. Stochastic Variance Reduction for Nonconvex Optimization , 2016, ICML.

[28] Ohad Shamir,et al. Without-Replacement Sampling for Stochastic Gradient Methods , 2016, NIPS.

[29] D. Bertsekas,et al. Convergen e Rate of In remental Subgradient Algorithms , 2000 .

[30] Christopher Ré,et al. Towards a unified architecture for in-RDBMS analytics , 2012, SIGMOD Conference.

[31] Prateek Jain,et al. SGD without Replacement: Sharper Rates for General Smooth Convex Functions , 2019, ICML.

[32] Stephen J. Wright,et al. Hogwild: A Lock-Free Approach to Parallelizing Stochastic Gradient Descent , 2011, NIPS.

[33] Stephen J. Wright,et al. Analyzing random permutations for cyclic coordinate descent , 2020, Math. Comput..

[34] Jorge Nocedal,et al. Optimization Methods for Large-Scale Machine Learning , 2016, SIAM Rev..

[35] Stephen J. Wright,et al. Random permutations fix a worst case for cyclic coordinate descent , 2016, IMA Journal of Numerical Analysis.

[36] Léon Bottou,et al. Stochastic Gradient Descent Tricks , 2012, Neural Networks: Tricks of the Trade.

[37] Alexander Shapiro,et al. Stochastic Approximation approach to Stochastic Programming , 2013 .

[38] Justin Domke,et al. Finito: A faster, permutable incremental gradient method for big data problems , 2014, ICML.

[39] Tong Zhang,et al. Accelerating Stochastic Gradient Descent using Predictive Variance Reduction , 2013, NIPS.

[40] Tianbao Yang,et al. Distributed Stochastic Variance Reduced Gradient Methods and A Lower Bound for Communication Complexity , 2015 .

[41] Mark W. Schmidt,et al. Fast Convergence of Stochastic Gradient Descent under a Strong Growth Condition , 2013, 1308.6370.

[42] Tengzhou Zhang. A note on the non-commutative arithmetic-geometric mean inequality , 2014, 1411.5058.

[43] Elad Hazan,et al. An optimal algorithm for stochastic strongly-convex optimization , 2010, 1006.2425.

[44] B. Recht,et al. Beneath the valley of the noncommutative arithmetic-geometric mean inequality: conjectures, case-studies, and consequences , 2012, 1202.4184.