SGD with shuffling: optimal rates without component convexity and large epoch requirements

We study without-replacement SGD for solving finite-sum optimization problems. Specifically, depending on how the indices of the finite-sum are shuffled, we consider the RandomShuffle (shuffle at the beginning of each epoch) and SingleShuffle (shuffle only once) algorithms. First, we establish minimax optimal convergence rates of these algorithms up to poly-log factors. Notably, our analysis is general enough to cover gradient dominated nonconvex costs, and does not rely on the convexity of individual component functions unlike existing optimal convergence results. Secondly, assuming convexity of the individual components, we further sharpen the tight convergence results for RandomShuffle by removing the drawbacks common to all prior arts: large number of epochs required for the results to hold, and extra poly-log factor gaps to the lower bound.

[1]  H. Robbins A Stochastic Approximation Method , 1951 .

[2]  Ohad Shamir,et al.  How Good is SGD with Random Shuffling? , 2019, COLT 2019.

[3]  Prateek Jain,et al.  SGD without Replacement: Sharper Rates for General Smooth Convex Functions , 2019, ICML.

[4]  V. Fabian Stochastic Approximation of Minima with Improved Asymptotic Speed , 1967 .

[5]  J. Kiefer,et al.  Stochastic Estimation of the Maximum of a Regression Function , 1952 .

[6]  Paul Tseng,et al.  An Incremental Gradient(-Projection) Method with Momentum Term and Adaptive Stepsize Rule , 1998, SIAM J. Optim..

[7]  Odalric-Ambrym Maillard,et al.  Concentration inequalities for sampling without replacement , 2013, 1309.4029.

[8]  Albert R. Meyer,et al.  Mathematics for Computer Science , 2017 .

[9]  B. Recht,et al.  Beneath the valley of the noncommutative arithmetic-geometric mean inequality: conjectures, case-studies, and consequences , 2012, 1202.4184.

[10]  Marten van Dijk,et al.  A Unified Convergence Analysis for Shuffling-Type Gradient Methods , 2020, ArXiv.

[11]  Asuman E. Ozdaglar,et al.  Why random reshuffling beats stochastic gradient descent , 2015, Mathematical Programming.

[12]  Ohad Shamir,et al.  Without-Replacement Sampling for Stochastic Gradient Methods , 2016, NIPS.

[13]  D. Bertsekas,et al.  Convergen e Rate of In remental Subgradient Algorithms , 2000 .

[14]  Markus Schneider Probability Inequalities for Kernel Embeddings in Sampling without Replacement , 2016, AISTATS.

[15]  Dimitris Papailiopoulos,et al.  Closing the convergence gap of SGD without replacement , 2020, ICML.

[16]  Mark W. Schmidt,et al.  Linear Convergence of Gradient and Proximal-Gradient Methods Under the Polyak-Łojasiewicz Condition , 2016, ECML/PKDD.

[17]  L. Bottou Curiously Fast Convergence of some Stochastic Gradient Descent Algorithms , 2009 .

[18]  Konstantin Mishchenko,et al.  Random Reshuffling: Simple Analysis with Vast Improvements , 2020, NeurIPS.

[19]  Suvrit Sra,et al.  Random Shuffling Beats SGD after Finite Epochs , 2018, ICML.

[20]  Pablo A. Parrilo,et al.  Convergence Rate of Incremental Gradient and Incremental Newton Methods , 2019, SIAM J. Optim..

[21]  Lek-Heng Lim,et al.  Recht-Ré Noncommutative Arithmetic-Geometric Mean Conjecture is False , 2020, ICML.

[22]  Léon Bottou,et al.  Stochastic Gradient Descent Tricks , 2012, Neural Networks: Tricks of the Trade.