Instability, Computational Efficiency and Statistical Accuracy

Many statistical estimators are defined as the fixed point of a data-dependent operator, with estimators based on minimizing a cost function being an important special case. The limiting performance of such estimators depends on the properties of the population-level operator in the idealized limit of infinitely many samples. We develop a general framework that yields bounds on statistical accuracy based on the interplay between the deterministic convergence rate of the algorithm at the population level, and its degree of (in)stability when applied to an empirical object based on $n$ samples. Using this framework, we analyze both stable forms of gradient descent and some higher-order and unstable algorithms, including Newton's method and its cubic-regularized variant, as well as the EM algorithm. We provide applications of our general results to several concrete classes of models, including Gaussian mixture estimation, single-index models, and informative non-response models. We exhibit cases in which an unstable algorithm can achieve the same statistical accuracy as a stable algorithm in exponentially fewer steps---namely, with the number of iterations being reduced from polynomial to logarithmic in sample size $n$.

[1]  H. Chernoff Estimation of the mode , 1964 .

[2]  C. Manski MAXIMUM SCORE ESTIMATION OF THE STOCHASTIC UTILITY MODEL OF CHOICE , 1975 .

[3]  J. Heckman The Common Structure of Statistical Models of Truncation, Sample Selection and Limited Dependent Variables and a Simple Estimator for Such Models , 1976 .

[4]  D. Rubin,et al.  Maximum likelihood from incomplete data via the EM - algorithm plus discussions on the paper , 1977 .

[5]  Lung-fei Lee,et al.  Specification testing when score test statistics are identically zero , 1986 .

[6]  P. Rousseeuw Least Median of Squares Regression , 1984 .

[7]  R. Fletcher Practical Methods of Optimization , 1988 .

[8]  Ronald G. Shaiko,et al.  PRE-ELECTION POLITICAL POLLING AND THE NON-RESPONSE BIAS ISSUE , 1991 .

[9]  H. Ichimura,et al.  SEMIPARAMETRIC LEAST SQUARES (SLS) AND WEIGHTED SLS ESTIMATION OF SINGLE-INDEX MODELS , 1993 .

[10]  M. Kenward,et al.  Informative Drop‐Out in Longitudinal Data Analysis , 1994 .

[11]  W. Härdle,et al.  Direct Semiparametric Estimation of Single-Index Models with Discrete Covariates dpsfb950075.ps.tar = Enno MAMMEN J.S. MARRON: Mass Recentered Kernel Smoothers , 1996 .

[12]  Jiahua Chen Optimal Rate of Convergence for Finite Mixture Models , 1995 .

[13]  Jianqing Fan,et al.  Generalized Partially Linear Single-Index Models , 1997 .

[14]  J. Robins,et al.  Likelihood-based inference with singular information matrix , 2000 .

[15]  Yurii Nesterov,et al.  Introductory Lectures on Convex Optimization - A Basic Course , 2014, Applied Optimization.

[16]  Yin Zhang,et al.  Fixed-Point Continuation for l1-Minimization: Methodology and Convergence , 2008, SIAM J. Optim..

[17]  M. Wainwright,et al.  High-dimensional analysis of semidefinite relaxations for sparse principal components , 2008, 2008 IEEE International Symposium on Information Theory.

[18]  Marc Teboulle,et al.  A Fast Iterative Shrinkage-Thresholding Algorithm for Linear Inverse Problems , 2009, SIAM J. Imaging Sci..

[19]  Rahul Garg,et al.  Gradient descent with sparsification: an iterative algorithm for sparse recovery with restricted isometry property , 2009, ICML '09.

[20]  Martin J. Wainwright,et al.  Fast global convergence of gradient methods for high-dimensional statistical recovery , 2011, ArXiv.

[21]  Emmanuel J. Candès,et al.  PhaseLift: Exact and Stable Signal Recovery from Magnitude Measurements via Convex Programming , 2011, ArXiv.

[22]  Tong Zhang,et al.  A General Theory of Concave Regularization for High-Dimensional Sparse Estimation Problems , 2011, 1108.4988.

[23]  K. Mengersen,et al.  Asymptotic behaviour of the posterior distribution in overfitted mixture models , 2011 .

[24]  Emmanuel J. Candès,et al.  NESTA: A Fast and Accurate First-Order Method for Sparse Recovery , 2009, SIAM J. Imaging Sci..

[25]  Yonina C. Eldar,et al.  Phase Retrieval: Stability and Recovery Guarantees , 2012, ArXiv.

[26]  Po-Ling Loh,et al.  Regularized M-estimators with nonconvexity: statistical and algorithmic theory for local optima , 2013, J. Mach. Learn. Res..

[27]  Xiao-Tong Yuan,et al.  Truncated power method for sparse eigenvalue problems , 2011, J. Mach. Learn. Res..

[28]  Zongming Ma Sparse Principal Component Analysis and Iterative Thresholding , 2011, 1112.2432.

[29]  X. Nguyen Convergence of latent mixing measures in finite and infinite mixture models , 2011, 1109.3250.

[30]  J. Duderstadt,et al.  Asymptotic Distribution of The Maximum Likelihood Estimator for a Stochastic Frontier Function Model with a Singular Information Matrix , 2013 .

[31]  Zhaoran Wang,et al.  OPTIMAL COMPUTATIONAL AND STATISTICAL RATES OF CONVERGENCE FOR SPARSE NONCONVEX LEARNING PROBLEMS. , 2013, Annals of statistics.

[32]  Martin J. Wainwright,et al.  Statistical guarantees for the EM algorithm: From population to sample-based analysis , 2014, ArXiv.

[33]  Trevor Hastie,et al.  Statistical Learning with Sparsity: The Lasso and Generalizations , 2015 .

[34]  Nhat Ho,et al.  Convergence rates of parameter estimation for some weakly identifiable finite mixtures , 2016 .

[35]  Martin J. Wainwright,et al.  Statistical and computational guarantees for the Baum-Welch algorithm , 2015, 2015 53rd Annual Allerton Conference on Communication, Control, and Computing (Allerton).

[36]  Constantine Caramanis,et al.  Regularized EM Algorithms: A Unified Framework and Statistical Guarantees , 2015, NIPS.

[37]  Xiaodong Li,et al.  Phase Retrieval via Wirtinger Flow: Theory and Algorithms , 2014, IEEE Transactions on Information Theory.

[38]  Martin J. Wainwright,et al.  Fast low-rank estimation by projected gradient descent: General statistical and algorithmic guarantees , 2015, ArXiv.

[39]  Yoram Singer,et al.  Train faster, generalize better: Stability of stochastic gradient descent , 2015, ICML.

[40]  Yingbin Liang,et al.  A Nonconvex Approach for Phase Retrieval: Reshaped Wirtinger Flow and Incremental Algorithms , 2017, J. Mach. Learn. Res..

[41]  Christos Tzamos,et al.  Ten Steps of EM Suffice for Mixtures of Two Gaussians , 2016, COLT.

[42]  Yan Shuo Tan,et al.  Phase Retrieval via Randomized Kaczmarz: Theoretical Guarantees , 2017, ArXiv.

[43]  Dimitris S. Papailiopoulos,et al.  Stability and Generalization of Learning Algorithms that Converge to Global Optima , 2017, ICML.

[44]  Bin Yu,et al.  Stability and Convergence Trade-off of Iterative Optimization Algorithms , 2018, ArXiv.

[45]  Christoph H. Lampert,et al.  Data-Dependent Stability of Stochastic Gradient Descent , 2017, ICML.

[46]  Yuxin Chen,et al.  Gradient descent with random initialization: fast global convergence for nonconvex phase retrieval , 2018, Mathematical Programming.

[47]  Michael I. Jordan,et al.  A Diffusion Process Perspective on Posterior Contraction Rates for Parameters , 2019, 1909.00966.

[48]  Jing Ma,et al.  CHIME: Clustering of high-dimensional Gaussian mixtures with EM algorithm and its optimality , 2019, The Annals of Statistics.

[49]  Michael I. Jordan,et al.  Singularity, misspecification and the convergence rate of EM , 2018, The Annals of Statistics.

[50]  Martin J. Wainwright,et al.  Sharp Analysis of Expectation-Maximization for Weakly Identifiable Models , 2019, AISTATS.

[51]  Harrison H. Zhou,et al.  Randomly initialized EM algorithm for two-component Gaussian mixture achieves near optimality in O(√n) iterations , 2019, Mathematical Statistics and Learning.