论文信息 - Nonconvex Stochastic Scaled-Gradient Descent and Generalized Eigenvector Problems

Nonconvex Stochastic Scaled-Gradient Descent and Generalized Eigenvector Problems

Motivated by the problem of online canonical correlation analysis, we propose the Stochastic Scaled-Gradient Descent (SSGD) algorithm for minimizing the expectation of a stochastic function over a generic Riemannian manifold. SSGD generalizes the idea of projected stochastic gradient descent and allows the use of scaled stochastic gradients instead of stochastic gradients. In the special case of a spherical constraint, which arises in generalized eigenvector problems, we establish a nonasymptotic finite-sample bound of √ 1/T , and show that this rate is minimax optimal, up to a polylogarithmic factor of relevant parameters. On the asymptotic side, a novel trajectory-averaging argument allows us to achieve local asymptotic normality with a rate that matches that of Ruppert-Polyak-Juditsky averaging. We bring these ideas together in an application to online canonical correlation analysis, deriving, for the first time in the literature, an optimal one-time-scale algorithm with an explicit rate of local asymptotic convergence to normality. Numerical studies of canonical correlation analysis are also provided for synthetic data.

Michael I. Jordan | Chris Junchi Li

[1] Lin F. Yang,et al. On Constrained Nonconvex Stochastic Optimization: A Case Study for Generalized Eigenvalue Decomposition , 2019, AISTATS.

[2] H. Hotelling. Analysis of a complex of statistical variables into principal components. , 1933 .

[3] Handbook of Variational Methods for Nonlinear Geometric Data , 2020 .

[4] H. Hotelling. Relations Between Two Sets of Variates , 1936 .

[5] D K Smith,et al. Numerical Optimization , 2001, J. Oper. Res. Soc..

[6] Ker-Chau Li,et al. Sliced Inverse Regression for Dimension Reduction , 1991 .

[7] Boris Polyak,et al. Acceleration of stochastic approximation by averaging , 1992 .

[8] Sham M. Kakade,et al. Efficient Algorithms for Large-scale Generalized Eigenvector Computation and Canonical Correlation Analysis , 2016, ICML.

[9] Nathan Srebro,et al. Stochastic Approximation for Canonical Correlation Analysis , 2017, NIPS.

[10] Karl Pearson F.R.S.. LIII. On lines and planes of closest fit to systems of points in space , 1901 .

[11] Le Song,et al. Learning from Conditional Distributions via Dual Embeddings , 2016, AISTATS.

[12] Kean Ming Tan,et al. Sparse generalized eigenvalue problem: optimal statistical rates via truncated Rayleigh flow , 2016, Journal of the Royal Statistical Society: Series B (Statistical Methodology).

[13] Furong Huang,et al. Escaping From Saddle Points - Online Stochastic Gradient for Tensor Decomposition , 2015, COLT.

[14] H. Robbins. A Stochastic Approximation Method , 1951 .

[15] Michael I. Jordan,et al. Gen-Oja: Simple & Efficient Algorithm for Streaming Generalized Eigenvector Computation , 2018, NeurIPS.

[16] Yoram Singer,et al. Adaptive Subgradient Methods for Online Learning and Stochastic Optimization , 2011, J. Mach. Learn. Res..

[17] D. Ruppert,et al. Efficient Estimations from a Slowly Convergent Robbins-Monro Process , 1988 .

[18] G. A. Young,et al. High‐dimensional Statistics: A Non‐asymptotic Viewpoint, Martin J.Wainwright, Cambridge University Press, 2019, xvii 552 pages, £57.99, hardback ISBN: 978‐1‐1084‐9802‐9 , 2020, International Statistical Review.

[19] Sham M. Kakade,et al. Multi-view clustering via canonical correlation analysis , 2009, ICML '09.

[20] A. Montanari,et al. The landscape of empirical risk for nonconvex losses , 2016, The Annals of Statistics.

[21] Quansheng Liu,et al. Large deviation exponential inequalities for supermartingales , 2011 .

[22] Xiao-Tong Yuan,et al. Truncated power method for sparse eigenvalue problems , 2011, J. Mach. Learn. Res..

[23] Ohad Shamir,et al. Convergence of Stochastic Gradient Descent for PCA , 2015, ICML.

[24] Michael I. Jordan,et al. Stochastic Approximation for Online Tensorial Independent Component Analysis , 2020, COLT.

[25] Chao Gao,et al. Stochastic Canonical Correlation Analysis , 2017, J. Mach. Learn. Res..

[26] Nathan Srebro,et al. Stochastic optimization for PCA and PLS , 2012, 2012 50th Annual Allerton Conference on Communication, Control, and Computing (Allerton).

[27] Suvrit Sra,et al. Recent Advances in Stochastic Riemannian Optimization , 2020 .

[28] Zongming Ma. Sparse Principal Component Analysis and Iterative Thresholding , 2011, 1112.2432.

[29] Jimmy Ba,et al. Adam: A Method for Stochastic Optimization , 2014, ICLR.

[30] Ning Qian,et al. On the momentum term in gradient descent learning algorithms , 1999, Neural Networks.

[31] Martin J. Wainwright,et al. Statistical guarantees for the EM algorithm: From population to sample-based analysis , 2014, ArXiv.

[32] Dean P. Foster,et al. Finding Linear Structure in Large Datasets with Scalable Canonical Correlation Analysis , 2015, ICML.

[33] Michael I. Jordan,et al. On Nonconvex Optimization for Machine Learning: Gradients, Stochasticity, and Saddle Points , 2019 .

[34] Li Shen,et al. A Decomposition Algorithm for the Sparse Generalized Eigenvalue Problem , 2018, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[35] M. Stone. Continuum regression: Cross-validated sequentially constructed prediction embracing ordinary least s , 1990 .

[36] R. Fisher. THE USE OF MULTIPLE MEASUREMENTS IN TAXONOMIC PROBLEMS , 1936 .

[37] V. Borkar. Stochastic Approximation: A Dynamical Systems Viewpoint , 2008, Texts and Readings in Mathematics.

[38] Yuanzhi Li,et al. First Efficient Convergence for Streaming k-PCA: A Global, Gap-Free, and Near-Optimal Rate , 2016, 2017 IEEE 58th Annual Symposium on Foundations of Computer Science (FOCS).

[39] Cheng Li,et al. Fisher Linear Discriminant Analysis , 2014 .

[40] Tong Zhang,et al. Near-optimal stochastic approximation for online principal component estimation , 2016, Math. Program..

[41] Suvrit Sra,et al. First-order Methods for Geodesically Convex Optimization , 2016, COLT.

[42] Le Song,et al. Scalable Kernel Methods via Doubly Stochastic Gradients , 2014, NIPS.

[43] Arun K. Kuchibhotla,et al. Moving Beyond Sub-Gaussianity in High-Dimensional Statistics: Applications in Covariance Estimation and Linear Regression , 2018, 1804.02605.

[44] Yuanzhi Li,et al. Doubly Accelerated Methods for Faster CCA and Generalized Eigendecomposition , 2016, ICML.

[45] E. Oja. Simplified neuron model as a principal component analyzer , 1982, Journal of mathematical biology.

[46] Michael I. Jordan,et al. Gradient Descent Only Converges to Minimizers , 2016, COLT.

[47] Prateek Jain,et al. Streaming PCA: Matching Matrix Bernstein and Near-Optimal Finite Sample Guarantees for Oja's Algorithm , 2016, COLT.

[48] H. Kushner,et al. Stochastic Approximation and Recursive Algorithms and Applications , 2003 .