Stochastic Variance Reduction for Nonconvex Optimization

We study nonconvex finite-sum problems and analyze stochastic variance reduced gradient (SVRG) methods for them. SVRG and related methods have recently surged into prominence for convex optimization given their edge over stochastic gradient descent (SGD); but their theoretical analysis almost exclusively assumes convexity. In contrast, we prove non-asymptotic rates of convergence (to stationary points) of SVRG for nonconvex optimization, and show that it is provably faster than SGD and gradient descent. We also analyze a subclass of nonconvex problems on which SVRG attains linear convergence to the global optimum. We extend our analysis to mini-batch variants of SVRG, showing (theoretical) linear speedup due to mini-batching in parallel settings.

[1]  J. Moreau Fonctions convexes duales et points proximaux dans un espace hilbertien , 1962 .

[2]  Boris Polyak Gradient methods for the minimisation of functionals , 1963 .

[3]  Harold J. Kushner,et al.  wchastic. approximation methods for constrained and unconstrained systems , 1978 .

[4]  John Darzentas,et al.  Problem Complexity and Method Efficiency in Optimization , 1983 .

[5]  L. Bottou Stochastic Gradient Learning in Neural Networks , 1991 .

[6]  Tamer Basar,et al.  Analysis of Recursive Stochastic Algorithms , 2001 .

[7]  Yurii Nesterov,et al.  Introductory Lectures on Convex Optimization - A Basic Course , 2014, Applied Optimization.

[8]  Yurii Nesterov,et al.  Cubic regularization of Newton method and its global performance , 2006, Math. Program..

[9]  H. Robbins A Stochastic Approximation Method , 1951 .

[10]  Alexander Shapiro,et al.  Stochastic Approximation approach to Stochastic Programming , 2013 .

[11]  Yoshua Bengio,et al.  Understanding the difficulty of training deep feedforward neural networks , 2010, AISTATS.

[12]  Julien Mairal,et al.  Convex optimization with sparsity-inducing norms , 2011 .

[13]  Suvrit Sra,et al.  Scalable nonconvex inexact proximal splitting , 2012, NIPS.

[14]  Ohad Shamir,et al.  Optimal Distributed Online Prediction Using Mini-Batches , 2010, J. Mach. Learn. Res..

[15]  Shai Shalev-Shwartz,et al.  Stochastic dual coordinate ascent methods for regularized loss , 2012, J. Mach. Learn. Res..

[16]  Tong Zhang,et al.  Accelerating Stochastic Gradient Descent using Predictive Variance Reduction , 2013, NIPS.

[17]  Saeed Ghadimi,et al.  Stochastic First- and Zeroth-Order Methods for Nonconvex Stochastic Programming , 2013, SIAM J. Optim..

[18]  Justin Domke,et al.  Finito: A faster, permutable incremental gradient method for big data problems , 2014, ICML.

[19]  Francis Bach,et al.  SAGA: A Fast Incremental Gradient Method With Support for Non-Strongly Convex Composite Objectives , 2014, NIPS.

[20]  Alexander J. Smola,et al.  Efficient mini-batch training for stochastic optimization , 2014, KDD.

[21]  Lin Xiao,et al.  A Proximal Stochastic Gradient Method with Progressive Variance Reduction , 2014, SIAM J. Optim..

[22]  Stephen P. Boyd,et al.  Proximal Algorithms , 2013, Found. Trends Optim..

[23]  Ohad Shamir,et al.  A Stochastic PCA and SVD Algorithm with an Exponential Convergence Rate , 2014, ICML.

[24]  Léon Bottou,et al.  A Lower Bound for the Optimization of Finite Sums , 2014, ICML.

[25]  Shai Shalev-Shwartz,et al.  SDCA without Duality , 2015, ArXiv.

[26]  Shai Shalev-Shwartz,et al.  Beyond Convexity: Stochastic Quasi-Convex Optimization , 2015, NIPS.

[27]  Zeyuan Allen Zhu,et al.  UniVR: A Universal Variance Reduction Framework for Proximal Stochastic Gradient Method , 2015, ArXiv.

[28]  Furong Huang,et al.  Escaping From Saddle Points - Online Stochastic Gradient for Tensor Decomposition , 2015, COLT.

[29]  Yijun Huang,et al.  Asynchronous Parallel Stochastic Gradient for Nonconvex Optimization , 2015, NIPS.

[30]  Dimitri P. Bertsekas,et al.  Incremental Gradient, Subgradient, and Proximal Methods for Convex Optimization: A Survey , 2015, ArXiv.

[31]  Alexander J. Smola,et al.  On Variance Reduction in Stochastic Gradient Descent and its Asynchronous Variants , 2015, NIPS.

[32]  Ohad Shamir,et al.  Fast Stochastic Algorithms for SVD and PCA: Convergence Properties and Convexity , 2015, ICML.

[33]  Jie Liu,et al.  Mini-Batch Semi-Stochastic Gradient Descent in the Proximal Setting , 2014, IEEE Journal of Selected Topics in Signal Processing.

[34]  Alexander J. Smola,et al.  Fast Stochastic Methods for Nonsmooth Nonconvex Optimization , 2016, ArXiv.

[35]  Alexander J. Smola,et al.  Fast Incremental Method for Nonconvex Optimization , 2016, ArXiv.

[36]  Zeyuan Allen Zhu,et al.  Variance Reduction for Faster Non-Convex Optimization , 2016, ICML.

[37]  Peter Richtárik,et al.  Semi-Stochastic Gradient Descent Methods , 2013, Front. Appl. Math. Stat..

[38]  Mark W. Schmidt,et al.  Minimizing finite sums with the stochastic average gradient , 2013, Mathematical Programming.

[39]  Dimitris S. Papailiopoulos,et al.  Perturbed Iterate Analysis for Asynchronous Stochastic Optimization , 2015, SIAM J. Optim..

[40]  Yi Zhou,et al.  An optimal randomized incremental gradient method , 2015, Mathematical Programming.

[41]  Mingyi Hong,et al.  A Distributed, Asynchronous, and Incremental Algorithm for Nonconvex Optimization: An ADMM Approach , 2014, IEEE Transactions on Control of Network Systems.