On Linear Stochastic Approximation: Fine-grained Polyak-Ruppert and Non-Asymptotic Concentration

We undertake a precise study of the asymptotic and non-asymptotic properties of stochastic approximation procedures with Polyak-Ruppert averaging for solving a linear system $\bar{A} \theta = \bar{b}$. When the matrix $\bar{A}$ is Hurwitz, we prove a central limit theorem (CLT) for the averaged iterates with fixed step size and number of iterations going to infinity. The CLT characterizes the exact asymptotic covariance matrix, which is the sum of the classical Polyak-Ruppert covariance and a correction term that scales with the step size. Under assumptions on the tail of the noise distribution, we prove a non-asymptotic concentration inequality whose main term matches the covariance in CLT in any direction, up to universal constants. When the matrix $\bar{A}$ is not Hurwitz but only has non-negative real parts in its eigenvalues, we prove that the averaged LSA procedure actually achieves an $O(1/T)$ rate in mean-squared error. Our results provide a more refined understanding of linear stochastic approximation in both the asymptotic and non-asymptotic settings. We also show various applications of the main results, including the study of momentum-based stochastic gradient methods as well as temporal difference algorithms in reinforcement learning.

[1]  F. Downton Stochastic Approximation , 1969, Nature.

[2]  D. Freedman On Tail Probabilities for Martingales , 1975 .

[3]  C.C. White,et al.  Dynamic programming and stochastic control , 1978, Proceedings of the IEEE.

[4]  P. Hall,et al.  Martingale Limit Theory and Its Application , 1980 .

[5]  John Darzentas,et al.  Problem Complexity and Method Efficiency in Optimization , 1983 .

[6]  L. C. Thomas,et al.  Optimization over Time. Dynamic Programming and Stochastic Control. Volume 1 , 1983 .

[7]  D. Ruppert,et al.  Efficient Estimations from a Slowly Convergent Robbins-Monro Process , 1988 .

[8]  John N. Tsitsiklis,et al.  Parallel and distributed computation , 1989 .

[9]  Pierre Priouret,et al.  Adaptive Algorithms and Stochastic Approximations , 1990, Applications of Mathematics.

[10]  L. Perko Differential Equations and Dynamical Systems , 1991 .

[11]  Boris Polyak,et al.  Acceleration of stochastic approximation by averaging , 1992 .

[12]  Martin L. Puterman,et al.  Markov Decision Processes: Discrete Stochastic Dynamic Programming , 1994 .

[13]  O. Kallenberg Foundations of Modern Probability , 2021, Probability Theory and Stochastic Modelling.

[14]  H. Kushner,et al.  Stochastic Approximation and Recursive Algorithms and Applications , 2003 .

[15]  John N. Tsitsiklis,et al.  On Average Versus Discounted Reward Temporal-Difference Learning , 2002, Machine Learning.

[16]  Richard S. Sutton,et al.  Reinforcement Learning: An Introduction , 1998, IEEE Trans. Neural Networks.

[17]  Richard S. Sutton,et al.  Learning to predict by the methods of temporal differences , 1988, Machine Learning.

[18]  H. Robbins A Stochastic Approximation Method , 1951 .

[19]  C. Villani Optimal Transport: Old and New , 2008 .

[20]  V. Borkar Stochastic Approximation: A Dynamical Systems Viewpoint , 2008, Texts and Readings in Mathematics.

[21]  Alexander Shapiro,et al.  Stochastic Approximation approach to Stochastic Programming , 2013 .

[22]  Y. Ollivier,et al.  CURVATURE, CONCENTRATION AND ERROR ESTIMATES FOR MARKOV CHAIN MONTE CARLO , 2009, 0904.1312.

[23]  B. Davis,et al.  Integral Inequalities for Convex Functions of Operators on Martingales , 2011 .

[24]  Eric Moulines,et al.  Non-Asymptotic Analysis of Stochastic Approximation Algorithms for Machine Learning , 2011, NIPS.

[25]  Mark W. Schmidt,et al.  A Stochastic Gradient Method with an Exponential Convergence Rate for Finite Training Sets , 2012, NIPS.

[26]  Ohad Shamir,et al.  Making Gradient Descent Optimal for Strongly Convex Stochastic Optimization , 2011, ICML.

[27]  Martin J. Wainwright,et al.  Information-Theoretic Lower Bounds on the Oracle Complexity of Stochastic Convex Optimization , 2010, IEEE Transactions on Information Theory.

[28]  Ohad Shamir,et al.  Stochastic Gradient Descent for Non-smooth Optimization: Convergence Results and Optimal Averaging Schemes , 2012, ICML.

[29]  Tong Zhang,et al.  Accelerating Stochastic Gradient Descent using Predictive Variance Reduction , 2013, NIPS.

[30]  L. A. Prashanth,et al.  Stochastic approximation for speeding up LSTD (and LSPI) , 2013 .

[31]  Francis Bach,et al.  SAGA: A Fast Incremental Gradient Method With Support for Non-Strongly Convex Composite Objectives , 2014, NIPS.

[32]  Francis R. Bach,et al.  Stochastic Variance Reduction Methods for Saddle-Point Problems , 2016, NIPS.

[33]  Dimitri P. Bertsekas,et al.  Stochastic First-Order Methods with Random Constraint Projection , 2016, SIAM J. Optim..

[34]  Xin T. Tong,et al.  Statistical inference for model parameters in stochastic gradient descent , 2016, The Annals of Statistics.

[35]  F. Bach,et al.  Bridging the gap between constant step size stochastic gradient descent and Markov chains , 2017, The Annals of Statistics.

[36]  Prateek Jain,et al.  Parallelizing Stochastic Gradient Descent for Least Squares Regression: Mini-batching, Averaging, and Model Misspecification , 2016, J. Mach. Learn. Res..

[37]  Francis R. Bach,et al.  Harder, Better, Faster, Stronger Convergence Rates for Least-Squares Regression , 2016, J. Mach. Learn. Res..

[38]  David M. Blei,et al.  Stochastic Gradient Descent as Approximate Bayesian Inference , 2017, J. Mach. Learn. Res..

[39]  Prateek Jain,et al.  Accelerating Stochastic Gradient Descent , 2017, COLT.

[40]  Qiang Sun,et al.  Statistical Sparse Online Regression: A Diffusion Approximation Perspective , 2018, AISTATS.

[41]  Jalaj Bhandari,et al.  A Finite Time Analysis of Temporal Difference Learning With Linear Function Approximation , 2018, COLT.

[42]  Anastasios Kyrillidis,et al.  Statistical inference using SGD , 2017, AAAI.

[43]  Yuancheng Zhu,et al.  Uncertainty Quantification for Online Learning and Stochastic Approximation via Hierarchical Incremental Gradient Descent , 2018, 1802.04876.

[44]  Xian Wu,et al.  Variance reduced value iteration and faster algorithms for solving Markov decision processes , 2017, SODA.

[45]  Csaba Szepesvári,et al.  Linear Stochastic Approximation: How Far Does Constant Step-Size and Iterate Averaging Go? , 2018, AISTATS.

[46]  Jorge Nocedal,et al.  Optimization Methods for Large-Scale Machine Learning , 2016, SIAM Rev..

[47]  B. Pepin Time Averages of Stochastic Processes: a Martingale Approach , 2018, 1810.10945.

[48]  Tengyuan Liang,et al.  Statistical inference for the population landscape via moment‐adjusted stochastic gradients , 2017, Journal of the Royal Statistical Society: Series B (Statistical Methodology).

[49]  Prateek Jain,et al.  Making the Last Iterate of SGD Information Theoretically Optimal , 2019, COLT.

[50]  Eric Moulines,et al.  Non-asymptotic Analysis of Biased Stochastic Approximation Scheme , 2019, COLT.

[51]  Martin J. Wainwright,et al.  High-Dimensional Statistics , 2019 .

[52]  Martin J. Wainwright,et al.  Value function estimation in Markov reward processes: Instance-dependent 𝓁∞-bounds for policy evaluation , 2019, ArXiv.

[53]  Martin J. Wainwright,et al.  Stochastic approximation with cone-contractive operators: Sharp $\ell_\infty$-bounds for $Q$-learning , 2019, 1905.06265.

[54]  Martin J. Wainwright,et al.  Variance-reduced Q-learning is minimax optimal , 2019, ArXiv.

[55]  Krishnakumar Balasubramanian,et al.  Normal Approximation for Stochastic Gradient Descent via Non-Asymptotic Rates of Martingale CLT , 2019, COLT.

[56]  Martin J. Wainwright,et al.  Stochastic approximation with cone-contractive operators: Sharp 𝓁∞-bounds for Q-learning , 2019, ArXiv.

[57]  Martin J. Wainwright,et al.  Is Temporal Difference Learning Optimal? An Instance-Dependent Analysis , 2020, SIAM J. Math. Data Sci..

[58]  RockaJellm MONOTONE OPERATORS ASSOCIATED WITH SADDLE . FUNCTIONS AND MINIMAX PROBLEMS R . 1 ' , 2022 .