The Interplay Between Implicit Bias and Benign Overfitting in Two-Layer Linear Networks

The recent success of neural network models has shone light on a rather surprising statistical phenomenon: statistical models that perfectly fit noisy data can generalize well to unseen test data. Understanding this phenomenon of benign overfitting has attracted intense theoretical and empirical study. In this paper, we consider interpolating two-layer linear neural networks trained with gradient flow on the squared loss and derive bounds on the excess risk when the covariates satisfy sub-Gaussianity and anti-concentration properties, and the noise is independent and subGaussian. By leveraging recent results that characterize the implicit bias of this estimator, our bounds emphasize the role of both the quality of the initialization as well as the properties of the data covariance matrix in achieving low excess risk.

[1]  M. Rudelson,et al.  The smallest singular value of a random rectangular matrix , 2008, 0802.3956.

[2]  M. Rudelson,et al.  Non-asymptotic theory of random matrices: extreme singular values , 2010, 1003.2990.

[3]  M. Rudelson,et al.  Hanson-Wright inequality and sub-gaussian concentration , 2013 .

[4]  Xin-She Yang,et al.  Introduction to Algorithms , 2021, Nature-Inspired Optimization Algorithms.

[5]  V. Koltchinskii,et al.  Concentration inequalities and moment bounds for sample covariance operators , 2014, 1405.2468.

[6]  M. Rudelson,et al.  Small Ball Probabilities for Linear Images of High-Dimensional Distributions , 2014, 1402.4492.

[7]  Ryota Tomioka,et al.  In Search of the Real Inductive Bias: On the Role of Implicit Regularization in Deep Learning , 2014, ICLR.

[8]  Samy Bengio,et al.  Understanding deep learning requires rethinking generalization , 2016, ICLR.

[9]  Nathan Srebro,et al.  The Implicit Bias of Gradient Descent on Separable Data , 2017, J. Mach. Learn. Res..

[10]  Nathan Srebro,et al.  Characterizing Implicit Bias in Terms of Optimization Geometry , 2018, ICML.

[11]  Tengyuan Liang,et al.  Just Interpolate: Kernel "Ridgeless" Regression Can Generalize , 2018, The Annals of Statistics.

[12]  Nathan Srebro,et al.  Implicit Regularization in Matrix Factorization , 2017, 2018 Information Theory and Applications Workshop (ITA).

[13]  Mikhail Belkin,et al.  Overfitting or perfect fitting? Risk bounds for classification and regression rules that interpolate , 2018, NeurIPS.

[14]  Nathan Srebro,et al.  Implicit Bias of Gradient Descent on Linear Convolutional Networks , 2018, NeurIPS.

[15]  Roman Vershynin,et al.  High-Dimensional Probability , 2018 .

[16]  Sanjeev Arora,et al.  Implicit Regularization in Deep Matrix Factorization , 2019, NeurIPS.

[17]  A. Montanari,et al.  The generalization error of max-margin linear classifiers: High-dimensional asymptotics in the overparametrized regime , 2019 .

[18]  Matus Telgarsky,et al.  The implicit bias of gradient descent on nonseparable data , 2019, COLT.

[19]  Anant Sahai,et al.  Harmless interpolation of noisy data in regression , 2019, 2019 IEEE International Symposium on Information Theory (ISIT).

[20]  Nathan Srebro,et al.  Stochastic Gradient Descent on Separable Data: Exact Convergence with a Fixed Learning Rate , 2018, AISTATS.

[21]  Andrea Montanari,et al.  Surprises in High-Dimensional Ridgeless Least Squares Interpolation , 2019, Annals of statistics.

[22]  Mikhail Belkin,et al.  Reconciling modern machine-learning practice and the classical bias–variance trade-off , 2018, Proceedings of the National Academy of Sciences.

[23]  Tengyuan Liang,et al.  A Precise High-Dimensional Asymptotic Theory for Boosting and Min-L1-Norm Interpolated Classifiers , 2020, SSRN Electronic Journal.

[24]  D. Kobak,et al.  Optimal ridge penalty for real-world high-dimensional data can be zero or negative due to the implicit ridge regularization , 2018, 1805.10939.

[25]  Daniel M. Roy,et al.  In Defense of Uniform Convergence: Generalization via derandomization with an application to interpolating predictors , 2019, ICML.

[26]  Nathan Srebro,et al.  Kernel and Rich Regimes in Overparametrized Models , 2019, COLT.

[27]  Tengyuan Liang,et al.  On the Multiple Descent of Minimum-Norm Interpolants and Restricted Lower Isometry of Kernels , 2019, COLT.

[28]  Nadav Cohen,et al.  Implicit Regularization in Deep Learning May Not Be Explainable by Norms , 2020, NeurIPS.

[29]  Ji Xu,et al.  On the Optimal Weighted $\ell_2$ Regularization in Overparameterized Linear Regression , 2020, NeurIPS.

[30]  Chong You,et al.  Rethinking Bias-Variance Trade-off for Generalization of Neural Networks , 2020, ICML.

[31]  The Implicit Bias of Depth: How Incremental Learning Drives Generalization , 2019, ICLR.

[32]  A. Tsigler,et al.  Benign overfitting in ridge regression , 2020 .

[33]  Philip M. Long,et al.  Benign overfitting in linear regression , 2019, Proceedings of the National Academy of Sciences.

[34]  Mikhail Belkin,et al.  Classification vs regression in overparameterized regimes: Does the loss function matter? , 2020, J. Mach. Learn. Res..

[35]  Florentina Bunea,et al.  Interpolation under latent factor regression models , 2020, ArXiv.

[36]  Nathan Srebro,et al.  Uniform Convergence of Interpolators: Gaussian Width, Norm Bounds, and Benign Overfitting , 2021, NeurIPS.

[37]  Inductive Bias of Multi-Channel Linear Convolutional Networks with Bounded Weight Norm , 2021, ArXiv.

[38]  Ji Xu,et al.  On the proliferation of support vectors in high dimensions , 2020, ArXiv.

[39]  Mikhail Belkin,et al.  Fit without fear: remarkable mathematical phenomena of deep learning through the prism of interpolation , 2021, Acta Numerica.

[40]  Andrea Montanari,et al.  The Generalization Error of Random Features Regression: Precise Asymptotics and the Double Descent Curve , 2019, Communications on Pure and Applied Mathematics.

[41]  H. Mobahi,et al.  A Unifying View on Implicit Bias in Training Linear Neural Networks , 2020, ICLR.

[42]  Nathan Srebro,et al.  On the Implicit Bias of Initialization Shape: Beyond Infinitesimal Mirror Descent , 2021, ICML.

[43]  Philip M. Long,et al.  Finite-sample analysis of interpolating linear classifiers in the overparameterized regime , 2020, ArXiv.

[44]  Andrea Montanari,et al.  Deep learning: a statistical viewpoint , 2021, Acta Numerica.

[45]  Zhi-Hua Zhou,et al.  Towards an Understanding of Benign Overfitting in Neural Networks , 2021, ArXiv.