Benign overfitting in linear regression

The phenomenon of benign overfitting is one of the key mysteries uncovered by deep learning methodology: deep neural networks seem to predict well, even with a perfect fit to noisy training data. Motivated by this phenomenon, we consider when a perfect fit to training data in linear regression is compatible with accurate prediction. We give a characterization of linear regression problems for which the minimum norm interpolating prediction rule has near-optimal prediction accuracy. The characterization is in terms of two notions of the effective rank of the data covariance. It shows that overparameterization is essential for benign overfitting in this setting: the number of directions in parameter space that are unimportant for prediction must significantly exceed the sample size. By studying examples of data covariance properties that this characterization shows are required for benign overfitting, we find an important role for finite-dimensional data: the accuracy of the minimum norm interpolating prediction rule approaches the best possible accuracy for a much narrower range of properties of the data distribution when the data lie in an infinite-dimensional space vs. when the data lie in a finite-dimensional space with dimension that grows faster than the sample size.

[1]  C. Desoer,et al.  A Note on Pseudoinverses , 1963 .

[2]  Gene H. Golub,et al.  Matrix computations , 1983 .

[3]  H. Balsters,et al.  Learnability with respect to fixed distributions , 1991 .

[4]  Peter L. Bartlett,et al.  Learning with a slowly changing distribution , 1992, COLT '92.

[5]  Philip M. Long,et al.  Fat-shattering and the learnability of real-valued functions , 1994, COLT '94.

[6]  Peter L. Bartlett,et al.  Efficient agnostic learning of neural networks with bounded fan-in , 1996, IEEE Trans. Inf. Theory.

[7]  Peter L. Bartlett,et al.  The Sample Complexity of Pattern Classification with Neural Networks: The Size of the Weights is More Important than the Size of the Network , 1998, IEEE Trans. Inf. Theory.

[8]  L. Devroye,et al.  The Hilbert Kernel Regression Estimate , 1998 .

[9]  Peter L. Bartlett,et al.  Rademacher and Gaussian Complexities: Risk Bounds and Structural Results , 2003, J. Mach. Learn. Res..

[10]  Robert Tibshirani,et al.  The Elements of Statistical Learning , 2001 .

[11]  Nicolas Le Roux,et al.  Convex Neural Networks , 2005, NIPS.

[12]  Mark Rudelson,et al.  Sampling from large matrices: An approach through geometric functional analysis , 2005, JACM.

[13]  Roman Vershynin,et al.  Introduction to the non-asymptotic analysis of random matrices , 2010, Compressed Sensing.

[14]  M. Rudelson,et al.  Hanson-Wright inequality and sub-gaussian concentration , 2013 .

[15]  V. Koltchinskii,et al.  Concentration inequalities and moment bounds for sample covariance operators , 2014, 1405.2468.

[16]  Ryota Tomioka,et al.  Norm-Based Capacity Control in Neural Networks , 2015, COLT.

[17]  Samy Bengio,et al.  Understanding deep learning requires rethinking generalization , 2016, ICLR.

[18]  Matus Telgarsky,et al.  Spectrally-normalized margin bounds for neural networks , 2017, NIPS.

[19]  Francis R. Bach,et al.  Harder, Better, Faster, Stronger Convergence Rates for Least-Squares Regression , 2016, J. Mach. Learn. Res..

[20]  Le Song,et al.  Diverse Neural Network Learns True Target Functions , 2016, AISTATS.

[21]  Francis R. Bach,et al.  Breaking the Curse of Dimensionality with Convex Neural Networks , 2014, J. Mach. Learn. Res..

[22]  Mikhail Belkin,et al.  To understand deep learning we need to understand kernel learning , 2018, ICML.

[23]  Ohad Shamir,et al.  Size-Independent Sample Complexity of Neural Networks , 2017, COLT.

[24]  Tengyuan Liang,et al.  Just Interpolate: Kernel "Ridgeless" Regression Can Generalize , 2018, The Annals of Statistics.

[25]  Mikhail Belkin,et al.  Overfitting or perfect fitting? Risk bounds for classification and regression rules that interpolate , 2018, NeurIPS.

[26]  Mikhail Belkin,et al.  Approximation beats concentration? An approximation view on inference with smooth radial kernels , 2018, COLT.

[27]  Mikhail Belkin,et al.  Reconciling modern machine learning and the bias-variance trade-off , 2018, ArXiv.

[28]  Yuanzhi Li,et al.  Learning Overparameterized Neural Networks via Stochastic Gradient Descent on Structured Data , 2018, NeurIPS.

[29]  Mikhail Belkin,et al.  Reconciling modern machine-learning practice and the classical bias–variance trade-off , 2018, Proceedings of the National Academy of Sciences.

[30]  Arthur Jacot,et al.  Neural tangent kernel: convergence and generalization in neural networks (invited paper) , 2018, NeurIPS.

[31]  Liwei Wang,et al.  Gradient Descent Finds Global Minima of Deep Neural Networks , 2018, ICML.

[32]  Ruosong Wang,et al.  Fine-Grained Analysis of Optimization and Generalization for Overparameterized Two-Layer Neural Networks , 2019, ICML.

[33]  Yuan Cao,et al.  Stochastic Gradient Descent Optimizes Over-parameterized Deep ReLU Networks , 2018, ArXiv.

[34]  Anant Sahai,et al.  Harmless interpolation of noisy data in regression , 2019, 2019 IEEE International Symposium on Information Theory (ISIT).

[35]  Steffen Grünewälder,et al.  Ivanov-Regularised Least-Squares Estimators over Large RKHSs and Their Interpolation Spaces , 2019, J. Mach. Learn. Res..

[36]  Barnabás Póczos,et al.  Gradient Descent Provably Optimizes Over-parameterized Neural Networks , 2018, ICLR.

[37]  Mikhail Belkin,et al.  Does data interpolation contradict statistical optimality? , 2018, AISTATS.

[38]  Mikhail Belkin,et al.  Two models of double descent for weak features , 2019, SIAM J. Math. Data Sci..

[39]  Yuan Cao,et al.  Towards Understanding the Spectral Bias of Deep Learning , 2019, IJCAI.

[40]  Andrea Montanari,et al.  Surprises in High-Dimensional Ridgeless Least Squares Interpolation , 2019, Annals of statistics.