Quantitative W1 Convergence of Langevin-Like Stochastic Processes with Non-Convex Potential State-Dependent Noise

We prove quantitative convergence rates at which discrete Langevin-like processes converge to the invariant distribution of a related stochastic differential equation. We study the setup where the additive noise can be non-Gaussian and state-dependent and the potential function can be non-convex. We show that the key properties of these processes depend on the potential function and the second moment of the additive noise. We apply our theoretical findings to studying the convergence of Stochastic Gradient Descent (SGD) for non-convex problems and corroborate them with experiments using SGD to train deep neural networks on the CIFAR-10 dataset.

[1]  Alex Krizhevsky,et al.  Learning Multiple Layers of Features from Tiny Images , 2009 .

[2]  A. Eberle Reflection coupling and Wasserstein contractivity without convexity , 2011 .

[3]  A. Dalalyan Theoretical guarantees for approximate sampling from smooth and log‐concave densities , 2014, 1412.7392.

[4]  Andrew Zisserman,et al.  Very Deep Convolutional Networks for Large-Scale Image Recognition , 2014, ICLR.

[5]  Lester W. Mackey,et al.  Measuring Sample Quality with Diffusions , 2016, The Annals of Applied Probability.

[6]  David M. Blei,et al.  A Variational Analysis of Stochastic Gradient Algorithms , 2016, ICML.

[7]  A. Eberle Couplings, distances and contractivity for diffusion processes revisited , 2013 .

[8]  Elad Hoffer,et al.  Train longer, generalize better: closing the generalization gap in large batch training of neural networks , 2017, NIPS.

[9]  Jorge Nocedal,et al.  On Large-Batch Training for Deep Learning: Generalization Gap and Sharp Minima , 2016, ICLR.

[10]  Yoshua Bengio,et al.  Three Factors Influencing Minima in SGD , 2017, ArXiv.

[11]  Alex Zhai,et al.  The CLT in high dimensions: Quantitative bounds via martingale embedding , 2018, The Annals of Probability.

[12]  Ohad Shamir,et al.  Global Non-convex Optimization with Discretized Diffusions , 2018, NeurIPS.

[13]  Yuanzhi Li,et al.  An Alternative View: When Does SGD Escape Local Minima? , 2018, ICML.

[14]  Michael I. Jordan,et al.  Sharp Convergence Rates for Langevin Dynamics in the Nonconvex Setting , 2018, ArXiv.

[15]  Michael I. Jordan,et al.  Sampling can be faster than optimization , 2018, Proceedings of the National Academy of Sciences.

[16]  Alain Durmus,et al.  High-dimensional Bayesian inference via the unadjusted Langevin algorithm , 2016, Bernoulli.

[17]  Krishnakumar Balasubramanian,et al.  Normal Approximation for Stochastic Gradient Descent via Non-Asymptotic Rates of Martingale CLT , 2019, COLT.

[18]  Murat A. Erdogdu,et al.  Stochastic Runge-Kutta Accelerates Langevin Monte Carlo and Beyond , 2019, NeurIPS.

[19]  Michael I. Jordan,et al.  Quantitative Central Limit Theorems for Discrete Stochastic Processes , 2019, ArXiv.

[20]  A. Eberle,et al.  Coupling and convergence for Hamiltonian Monte Carlo , 2018, The Annals of Applied Probability.