When Does Preconditioning Help or Hurt Generalization?

While second order optimizers such as natural gradient descent (NGD) often speed up optimization, their effect on generalization remains controversial. For instance, it has been pointed out that gradient descent (GD), in contrast to many preconditioned updates, converges to small Euclidean norm solutions in overparameterized models, leading to favorable generalization properties. This work presents a more nuanced view on the comparison of generalization between first- and second-order methods. We provide an asymptotic bias-variance decomposition of the generalization error of overparameterized ridgeless regression under a general class of preconditioner $\boldsymbol{P}$, and consider the inverse population Fisher information matrix (used in NGD) as a particular example. We determine the optimal $\boldsymbol{P}$ for both the bias and variance, and find that the relative generalization performance of different optimizers depends on the label noise and the "shape" of the signal (true parameters): when the labels are noisy, the model is misspecified, or the signal is misaligned with the features, NGD can achieve lower risk; conversely, GD generalizes better than NGD under clean labels, a well-specified model, or aligned signal. Based on this analysis, we discuss several approaches to manage the bias-variance tradeoff, and the potential benefit of interpolating between GD and NGD. We then extend our analysis to regression in the reproducing kernel Hilbert space and demonstrate that preconditioned GD can decrease the population risk faster than GD. Lastly, we empirically compare the generalization performance of first- and second-order optimizers in neural network experiments, and observe robust trends matching our theoretical analysis.

[1]  Kenji Fukumizu,et al.  Adaptive Method of Realizing Natural Gradient Learning for Multilayer Perceptrons , 2000, Neural Computation.

[2]  Ji Xu,et al.  On the Optimal Weighted $\ell_2$ Regularization in Overparameterized Linear Regression , 2020, NeurIPS.

[3]  Geoffrey E. Hinton,et al.  Distilling the Knowledge in a Neural Network , 2015, ArXiv.

[4]  Yann Ollivier,et al.  Practical Riemannian Neural Networks , 2016, ArXiv.

[5]  J. Zico Kolter,et al.  A Continuous-Time View of Early Stopping for Least Squares Regression , 2018, AISTATS.

[6]  Ruosong Wang,et al.  Fine-Grained Analysis of Optimization and Generalization for Overparameterized Two-Layer Neural Networks , 2019, ICML.

[7]  Bernhard Schölkopf,et al.  Measuring Statistical Dependence with Hilbert-Schmidt Norms , 2005, ALT.

[8]  俊一 甘利 5分で分かる!? 有名論文ナナメ読み:Jacot, Arthor, Gabriel, Franck and Hongler, Clement : Neural Tangent Kernel : Convergence and Generalization in Neural Networks , 2020 .

[9]  Guodong Zhang,et al.  Which Algorithmic Choices Matter at Which Batch Sizes? Insights From a Noisy Quadratic Model , 2019, NeurIPS.

[10]  Alessandro Rudi,et al.  Statistical Optimality of Stochastic Gradient Descent on Hard Learning Problems through Multiple Passes , 2018, NeurIPS.

[11]  Roger B. Grosse,et al.  Optimizing Neural Networks with Kronecker-factored Approximate Curvature , 2015, ICML.

[12]  Lili Su,et al.  On Learning Over-parameterized Neural Networks: A Functional Approximation Prospective , 2019, NeurIPS.

[13]  Razvan Pascanu,et al.  Revisiting Natural Gradient for Deep Networks , 2013, ICLR.

[14]  Frederik Kunstner,et al.  Limitations of the Empirical Fisher Approximation , 2019, NeurIPS.

[15]  James Martens,et al.  Deep learning via Hessian-free optimization , 2010, ICML.

[16]  Lorenzo Rosasco,et al.  Generalization Properties of Learning with Random Features , 2016, NIPS.

[17]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[18]  Andrea Montanari,et al.  Limitations of Lazy Training of Two-layers Neural Networks , 2019, NeurIPS.

[19]  Nathan Srebro,et al.  The Implicit Bias of Gradient Descent on Separable Data , 2017, J. Mach. Learn. Res..

[20]  Surya Ganguli,et al.  Exact solutions to the nonlinear dynamics of learning in deep linear neural networks , 2013, ICLR.

[21]  Razvan Pascanu,et al.  Sharp Minima Can Generalize For Deep Nets , 2017, ICML.

[22]  Gang Niu,et al.  Do We Need Zero Training Loss After Achieving Zero Training Error? , 2020, ICML.

[23]  Qian Qian,et al.  The Implicit Bias of AdaGrad on Separable Data , 2019, NeurIPS.

[24]  Shun-ichi Amari,et al.  Universal statistics of Fisher information in deep neural networks: mean field approach , 2018, AISTATS.

[25]  James Martens,et al.  New perspectives on the natural gradient method , 2014, ArXiv.

[26]  Le Song,et al.  Diverse Neural Network Learns True Target Functions , 2016, AISTATS.

[27]  Nathan Srebro,et al.  Characterizing Implicit Bias in Terms of Optimization Geometry , 2018, ICML.

[28]  Noureddine El Karoui,et al.  Asymptotic behavior of unregularized and ridge-regularized high-dimensional robust regression estimators : rigorous results , 2013, 1311.2445.

[29]  Mark Chen,et al.  Language Models are Few-Shot Learners , 2020, NeurIPS.

[30]  Shun-ichi Amari,et al.  Neural Learning in Structured Parameter Spaces - Natural Riemannian Gradient , 1996, NIPS.

[31]  Lorenzo Rosasco,et al.  Implicit Regularization of Accelerated Methods in Hilbert Spaces , 2019, NeurIPS.

[32]  Razvan Pascanu,et al.  Metric-Free Natural Gradient for Joint-Training of Boltzmann Machines , 2013, ICLR.

[33]  S. Péché,et al.  Eigenvectors of some large sample covariance matrix ensembles , 2009 .

[34]  John C. Duchi,et al.  Necessary and Sufficient Geometries for Gradient Methods , 2019, NeurIPS.

[35]  D K Smith,et al.  Numerical Optimization , 2001, J. Oper. Res. Soc..

[36]  Andrea Montanari,et al.  Surprises in High-Dimensional Ridgeless Least Squares Interpolation , 2019, Annals of statistics.

[37]  Hongyang Zhang,et al.  Algorithmic Regularization in Over-parameterized Matrix Sensing and Neural Networks with Quadratic Activations , 2017, COLT.

[38]  Ji Xu,et al.  On the number of variables to use in principal component regression , 2019, NeurIPS.

[39]  James Martens,et al.  New Insights and Perspectives on the Natural Gradient Method , 2014, J. Mach. Learn. Res..

[40]  Felipe Cucker,et al.  On the mathematical foundations of learning , 2001 .

[41]  Peng Xu,et al.  Second-Order Optimization for Non-Convex Machine Learning: An Empirical Study , 2017, SDM.

[42]  Ilya Sutskever,et al.  Training Deep and Recurrent Networks with Hessian-Free Optimization , 2012, Neural Networks: Tricks of the Trade.

[43]  Tengyuan Liang,et al.  Just Interpolate: Kernel "Ridgeless" Regression Can Generalize , 2018, The Annals of Statistics.

[44]  Daniel Hsu,et al.  How many variables should be entered in a principal component regression equation? , 2019, NeurIPS 2019.

[45]  Francis Bach,et al.  Implicit Regularization of Discrete Gradient Dynamics in Deep Linear Neural Networks , 2019, NeurIPS.

[46]  Taiji Suzuki,et al.  Gradient Descent in RKHS with Importance Labeling , 2020, AISTATS.

[47]  Kaifeng Lyu,et al.  Gradient Descent Maximizes the Margin of Homogeneous Neural Networks , 2019, ICLR.

[48]  Neha S. Wadia,et al.  Whitening and second order optimization both destroy information about the dataset, and can make generalization impossible , 2020, ArXiv.

[49]  P. Wedin Perturbation theory for pseudo-inverses , 1973 .

[50]  Yuanzhi Li,et al.  What Can ResNet Learn Efficiently, Going Beyond Kernels? , 2019, NeurIPS.

[51]  Sanjeev Arora,et al.  Implicit Regularization in Deep Matrix Factorization , 2019, NeurIPS.

[52]  Stefan Wager,et al.  High-Dimensional Asymptotics of Prediction: Ridge Regression and Classification , 2015, 1507.03003.

[53]  Zhenyu Liao,et al.  A Random Matrix Approach to Neural Networks , 2017, ArXiv.

[54]  Yann Ollivier,et al.  Riemannian metrics for neural networks I: feedforward networks , 2013, 1303.0818.

[55]  Yi Zhang,et al.  The Case for Full-Matrix Adaptive Regularization , 2018, ArXiv.

[56]  Nathan Srebro,et al.  Implicit Bias of Gradient Descent on Linear Convolutional Networks , 2018, NeurIPS.

[57]  Francis Bach,et al.  Implicit Bias of Gradient Descent for Wide Two-layer Neural Networks Trained with the Logistic Loss , 2020, COLT.

[58]  Joan Bruna,et al.  Gradient Dynamics of Shallow Univariate ReLU Networks , 2019, NeurIPS.

[59]  Sakinah,et al.  Vol. , 2020, New Medit.

[60]  Geoffrey E. Hinton,et al.  Rectified Linear Units Improve Restricted Boltzmann Machines , 2010, ICML.

[61]  Frederik Kunstner,et al.  Limitations of the empirical Fisher approximation for natural gradient descent , 2019, NeurIPS.

[62]  Mikhail Belkin,et al.  Reconciling modern machine learning and the bias-variance trade-off , 2018, ArXiv.

[63]  F. Rubio,et al.  Spectral convergence for a general class of random matrices , 2011 .

[64]  Nathan Srebro,et al.  Kernel and Rich Regimes in Overparametrized Models , 2019, COLT.

[65]  Mikhail Belkin,et al.  Classification vs regression in overparameterized regimes: Does the loss function matter? , 2020, J. Mach. Learn. Res..

[66]  Yuan Cao,et al.  Towards Understanding the Spectral Bias of Deep Learning , 2021, IJCAI.

[67]  Andrew M. Saxe,et al.  High-dimensional dynamics of generalization error in neural networks , 2017, Neural Networks.

[68]  Julien Mairal,et al.  On the Inductive Bias of Neural Tangent Kernels , 2019, NeurIPS.

[69]  Lorenzo Rosasco,et al.  Asymptotics of Ridge(less) Regression under General Source Condition , 2020, AISTATS.

[70]  Guodong Zhang,et al.  Fast Convergence of Natural Gradient Descent for Overparameterized Neural Networks , 2019, NeurIPS.

[71]  Babak Hassibi,et al.  Stochastic Gradient/Mirror Descent: Minimax Optimality and Implicit Regularization , 2018, ICLR.

[72]  Mikhail Belkin,et al.  Reconciling modern machine-learning practice and the classical bias–variance trade-off , 2018, Proceedings of the National Academy of Sciences.

[73]  Nicolas Le Roux,et al.  On the interplay between noise and curvature and its effect on optimization and generalization , 2019, AISTATS.

[74]  Colin Wei,et al.  Regularization Matters: Generalization and Optimization of Neural Nets v.s. their Induced Kernel , 2018, NeurIPS.

[75]  Nadav Cohen,et al.  Implicit Regularization in Deep Learning May Not Be Explainable by Norms , 2020, NeurIPS.

[76]  Philip M. Long,et al.  Benign overfitting in linear regression , 2019, Proceedings of the National Academy of Sciences.

[77]  Mikhail Belkin,et al.  Overfitting or perfect fitting? Risk bounds for classification and regression rules that interpolate , 2018, NeurIPS.

[78]  Boaz Barak,et al.  Deep double descent: where bigger models and more data hurt , 2019, ICLR.

[79]  Richard Socher,et al.  Improving Generalization Performance by Switching from Adam to SGD , 2017, ArXiv.

[80]  Aaron Mishkin,et al.  To Each Optimizer a Norm, To Each Norm its Generalization , 2020, ArXiv.

[81]  Christos Thrampoulidis,et al.  Sharp Asymptotics and Optimal Performance for Inference in Binary Models , 2020, AISTATS.

[82]  Roger B. Grosse,et al.  A Kronecker-factored approximate Fisher matrix for convolution layers , 2016, ICML.

[83]  Nathan Srebro,et al.  Implicit Regularization in Matrix Factorization , 2017, 2018 Information Theory and Applications Workshop (ITA).

[84]  Zhenyu Liao,et al.  The Dynamics of Learning: A Random Matrix Approach , 2018, ICML.

[85]  Andrea Montanari,et al.  The generalization error of max-margin linear classifiers: High-dimensional asymptotics in the overparametrized regime , 2019 .

[86]  Arthur Jacot,et al.  Neural tangent kernel: convergence and generalization in neural networks (invited paper) , 2018, NeurIPS.

[87]  Lei Wu How SGD Selects the Global Minima in Over-parameterized Learning : A Dynamical Stability Perspective , 2018 .

[88]  Don R. Hush,et al.  Optimal Rates for Regularized Least Squares Regression , 2009, COLT.

[89]  Jorge Nocedal,et al.  On Large-Batch Training for Deep Learning: Generalization Gap and Sharp Minima , 2016, ICLR.

[90]  Nathan Srebro,et al.  The Marginal Value of Adaptive Gradient Methods in Machine Learning , 2017, NIPS.

[91]  Qiang Liu,et al.  On the Margin Theory of Feedforward Neural Networks , 2018, ArXiv.

[92]  Babak Hassibi,et al.  Stochastic Mirror Descent on Overparameterized Nonlinear Models , 2019, IEEE Transactions on Neural Networks and Learning Systems.

[93]  Christos Thrampoulidis,et al.  A Model of Double Descent for High-dimensional Binary Linear Classification , 2019, ArXiv.

[94]  Andrea Montanari,et al.  The Generalization Error of Random Features Regression: Precise Asymptotics and the Double Descent Curve , 2019, Communications on Pure and Applied Mathematics.

[95]  Yue M. Lu,et al.  Universality Laws for High-Dimensional Learning with Random Features , 2020, ArXiv.

[96]  Florent Krzakala,et al.  Double Trouble in Double Descent : Bias and Variance(s) in the Lazy Regime , 2020, ICML.

[97]  Taiji Suzuki,et al.  Generalization of Two-layer Neural Networks: An Asymptotic Viewpoint , 2020, ICLR.

[98]  Samy Bengio,et al.  Understanding deep learning requires rethinking generalization , 2016, ICLR.

[99]  Shun-ichi Amari,et al.  The Normalization Method for Alleviating Pathological Sharpness in Wide Neural Networks , 2019, NeurIPS.

[100]  Shun-ichi Amari,et al.  Natural Gradient Works Efficiently in Learning , 1998, Neural Computation.

[101]  A. Caponnetto,et al.  Optimal Rates for the Regularized Least-Squares Algorithm , 2007, Found. Comput. Math..

[102]  Florent Krzakala,et al.  Generalisation error in learning with random features and the hidden manifold model , 2020, ICML.

[103]  Pradeep Ravikumar,et al.  Connecting Optimization and Regularization Paths , 2018, NeurIPS.

[104]  Matus Telgarsky,et al.  The implicit bias of gradient descent on nonseparable data , 2019, COLT.

[105]  Guodong Zhang,et al.  Three Mechanisms of Weight Decay Regularization , 2018, ICLR.

[106]  Taiji Suzuki,et al.  Adaptivity of deep ReLU network for learning in Besov and mixed smooth Besov spaces: optimal rate and curse of dimensionality , 2018, ICLR.

[107]  Bin Dong,et al.  Distillation ≈ Early Stopping? Harvesting Dark Knowledge Utilizing Anisotropic Information Retrieval For Overparameterized Neural Network , 2019, ArXiv.

[108]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[109]  Di He,et al.  A Gram-Gauss-Newton Method Learning Overparameterized Deep Neural Networks for Regression Problems , 2019, ArXiv.

[110]  Matus Telgarsky,et al.  Gradient descent aligns the layers of deep linear networks , 2018, ICLR.

[111]  Greg Yang,et al.  Feature Learning in Infinite-Width Neural Networks , 2020, ArXiv.

[112]  Lorenzo Rosasco,et al.  Optimal Rates for Multi-pass Stochastic Gradient Methods , 2016, J. Mach. Learn. Res..

[113]  Rich Caruana,et al.  Do Deep Nets Really Need to be Deep? , 2013, NIPS.

[114]  Stanislav Minsker On Some Extensions of Bernstein's Inequality for Self-adjoint Operators , 2011, 1112.5448.

[115]  Samet Oymak,et al.  Gradient Descent with Early Stopping is Provably Robust to Label Noise for Overparameterized Neural Networks , 2019, AISTATS.

[116]  Mikhail Belkin,et al.  Does data interpolation contradict statistical optimality? , 2018, AISTATS.

[117]  Yoram Singer,et al.  Adaptive Subgradient Methods for Online Learning and Stochastic Optimization , 2011, J. Mach. Learn. Res..

[118]  Stephen P. Boyd,et al.  Convex Optimization , 2004, Algorithms and Theory of Computation Handbook.