Optimal prediction for sparse linear models? Lower bounds for coordinate-separable M-estimators

For the problem of high-dimensional sparse linear regression, it is known that an $\ell_0$-based estimator can achieve a $1/n$ "fast" rate on the prediction error without any conditions on the design matrix, whereas in absence of restrictive conditions on the design matrix, popular polynomial-time methods only guarantee the $1/\sqrt{n}$ "slow" rate. In this paper, we show that the slow rate is intrinsic to a broad class of M-estimators. In particular, for estimators based on minimizing a least-squares cost function together with a (possibly non-convex) coordinate-wise separable regularizer, there is always a "bad" local optimum such that the associated prediction error is lower bounded by a constant multiple of $1/\sqrt{n}$. For convex regularizers, this lower bound applies to all global optima. The theory is applicable to many popular estimators, including convex $\ell_1$-based methods as well as M-estimators based on nonconvex regularizers, including the SCAD penalty or the MCP regularizer. In addition, for a broad class of nonconvex regularizers, we show that the bad local optima are very common, in that a broad class of local minimization algorithms with random initialization will typically converge to a bad solution.

[1]  Philip Wolfe,et al.  Note on a method of conjugate subgradients for minimizing nondifferentiable functions , 1974, Math. Program..

[2]  R. Bellman,et al.  An Introduction to Minimax , 1976 .

[3]  V. F. Demʹi︠a︡nov,et al.  Introduction to minimax , 1976 .

[4]  R. Mifflin A modification and an extension of Lemarechal’s algorithm for nonsmooth minimization , 1982 .

[5]  Krzysztof C. Kiwiel,et al.  An aggregate subgradient method for nonsmooth convex minimization , 1983, Math. Program..

[6]  F. Clarke Optimization And Nonsmooth Analysis , 1983 .

[7]  Warren P. Adams,et al.  A hierarchy of relaxation between the continuous and convex hull representations , 1990 .

[8]  R. Durrett Probability: Theory and Examples , 1993 .

[9]  Hanif D. Sherali,et al.  A Hierarchy of Relaxations Between the Continuous and Convex Hull Representations for Zero-One Programming Problems , 1990, SIAM J. Discret. Math..

[10]  J. Friedman,et al.  A Statistical View of Some Chemometrics Regression Tools , 1993 .

[11]  J. Friedman,et al.  [A Statistical View of Some Chemometrics Regression Tools]: Response , 1993 .

[12]  Balas K. Natarajan,et al.  Sparse Approximate Solutions to Linear Systems , 1995, SIAM J. Comput..

[13]  R. Tibshirani Regression Shrinkage and Selection via the Lasso , 1996 .

[14]  Michael A. Saunders,et al.  Atomic Decomposition by Basis Pursuit , 1998, SIAM J. Sci. Comput..

[15]  Arthur E. Hoerl,et al.  Ridge Regression: Biased Estimation for Nonorthogonal Problems , 2000, Technometrics.

[16]  Arkadi Nemirovski,et al.  Topics in Non-Parametric Statistics , 2000 .

[17]  Jianqing Fan,et al.  Variable Selection via Nonconcave Penalized Likelihood and its Oracle Properties , 2001 .

[18]  Jean B. Lasserre,et al.  An Explicit Exact SDP Relaxation for Nonlinear 0-1 Programs , 2001, IPCO.

[19]  Sanjoy Dasgupta,et al.  An elementary proof of a theorem of Johnson and Lindenstrauss , 2003, Random Struct. Algorithms.

[20]  Stephen P. Boyd,et al.  Convex Optimization , 2004, Algorithms and Theory of Computation Handbook.

[21]  S. Mendelson,et al.  Uniform Uncertainty Principle for Bernoulli and Subgaussian Ensembles , 2006, math/0608665.

[22]  H. Zou The Adaptive Lasso and Its Oracle Properties , 2006 .

[23]  A. Tsybakov,et al.  Aggregation for Gaussian regression , 2007, 0710.3654.

[24]  Terence Tao,et al.  The Dantzig selector: Statistical estimation when P is much larger than n , 2005, math/0506081.

[25]  S. Geer,et al.  On the conditions used to prove oracle results for the Lasso , 2009, 0910.0722.

[26]  E. Candès,et al.  Near-ideal model selection by ℓ1 minimization , 2008, 0801.0345.

[27]  Martin J. Wainwright,et al.  A unified framework for high-dimensional analysis of $M$-estimators with decomposable regularizers , 2009, NIPS.

[28]  P. Bickel,et al.  SIMULTANEOUS ANALYSIS OF LASSO AND DANTZIG SELECTOR , 2008, 0801.1095.

[29]  A. Belloni,et al.  Square-Root Lasso: Pivotal Recovery of Sparse Signals via Conic Programming , 2010, 1009.5689.

[30]  M. Rudelson,et al.  Non-asymptotic theory of random matrices: extreme singular values , 2010, 1003.2990.

[31]  Cun-Hui Zhang Nearly unbiased variable selection under minimax concave penalty , 2010, 1002.4734.

[32]  Martin J. Wainwright,et al.  Restricted Eigenvalue Properties for Correlated Gaussian Designs , 2010, J. Mach. Learn. Res..

[33]  Martin J. Wainwright,et al.  Minimax Rates of Estimation for High-Dimensional Linear Regression Over $\ell_q$ -Balls , 2009, IEEE Transactions on Information Theory.

[34]  Sara van de Geer,et al.  Statistics for High-Dimensional Data , 2011 .

[35]  Cun-Hui Zhang,et al.  Scaled sparse linear regression , 2011, 1104.4595.

[36]  Po-Ling Loh,et al.  Regularized M-estimators with nonconvexity: statistical and algorithmic theory for local optima , 2013, J. Mach. Learn. Res..

[37]  Michael I. Jordan On statistics, computation and scalability , 2013, ArXiv.

[38]  M.E.Sc. Wieslaw Stepniewski The Prediction of Performance , 2013 .

[39]  Martin J. Wainwright,et al.  Lower bounds on the performance of polynomial-time algorithms for sparse linear regression , 2014, COLT.

[40]  M. Wainwright Constrained forms of statistical minimax : 1 Computation , communication , and privacy , 2014 .

[41]  A. Dalalyan,et al.  On the Prediction Performance of the Lasso , 2014, 1402.1700.

[42]  Martin J. Wainwright,et al.  Sparse learning via Boolean relaxations , 2015, Mathematical Programming.

[43]  Hao Yin,et al.  Strong NP-Hardness Result for Regularized $L_q$-Minimization Problems with Concave Penalty Functions , 2015, ArXiv.

[44]  K. Schittkowski,et al.  NONLINEAR PROGRAMMING , 2022 .