Pathological Spectra of the Fisher Information Metric and Its Variants in Deep Neural Networks

The Fisher information matrix (FIM) plays an essential role in statistics and machine learning as a Riemannian metric tensor or a component of the Hessian matrix of loss functions. Focusing on the FIM and its variants in deep neural networks (DNNs), we reveal their characteristic scale dependence on the network width, depth, and sample size when the network has random weights and is sufficiently wide. This study covers two widely used FIMs for regression with linear output and for classification with softmax output. Both FIMs asymptotically show pathological eigenvalue spectra in the sense that a small number of eigenvalues become large outliers depending on the width or sample size, while the others are much smaller. It implies that the local shape of the parameter space or loss landscape is very sharp in a few specific directions while almost flat in the other directions. In particular, the softmax output disperses the outliers and makes a tail of the eigenvalue density spread from the bulk. We also show that pathological spectra appear in other variants of FIMs: one is the neural tangent kernel; another is a metric for the input signal and feature space that arises from feedforward signal propagation. Thus, we provide a unified perspective on the FIM and its variants that will lead to more quantitative understanding of learning in large-scale DNNs.

[1]  Yann LeCun,et al.  Effiicient BackProp , 1996, Neural Networks: Tricks of the Trade.

[2]  Kenji Fukumizu,et al.  Adaptive natural gradient learning algorithms for various stochastic models , 2000, Neural Networks.

[3]  Roger B. Grosse,et al.  Optimizing Neural Networks with Kronecker-factored Approximate Curvature , 2015, ICML.

[4]  Razvan Pascanu,et al.  Overcoming catastrophic forgetting in neural networks , 2016, Proceedings of the National Academy of Sciences.

[5]  Ke Sun,et al.  Lightlike Neuromanifolds, Occam's Razor and Deep Learning , 2019, ArXiv.

[6]  Z. Fan,et al.  Spectra of the Conjugate Kernel and Neural Tangent Kernel for linear-width neural networks , 2020, NeurIPS.

[7]  Surya Ganguli,et al.  Exponential expressivity in deep neural networks through transient chaos , 2016, NIPS.

[8]  Jaehoon Lee,et al.  Deep Neural Networks as Gaussian Processes , 2017, ICLR.

[9]  Greg Yang,et al.  A Fine-Grained Spectral Perspective on Neural Networks , 2019, ArXiv.

[10]  Jian Sun,et al.  Delving Deep into Rectifiers: Surpassing Human-Level Performance on ImageNet Classification , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[11]  Shun-ichi Amari,et al.  Information Geometry and Its Applications , 2016 .

[12]  Ruosong Wang,et al.  On Exact Computation with an Infinitely Wide Neural Net , 2019, NeurIPS.

[13]  Razvan Pascanu,et al.  Revisiting Natural Gradient for Deep Networks , 2013, ICLR.

[14]  Yoram Singer,et al.  Toward Deeper Understanding of Neural Networks: The Power of Initialization and a Dual View on Expressivity , 2016, NIPS.

[15]  Taiji Suzuki,et al.  Fast generalization error bound of deep learning from a kernel perspective , 2018, AISTATS.

[16]  Shun-ichi Amari,et al.  The Normalization Method for Alleviating Pathological Sharpness in Wide Neural Networks , 2019, NeurIPS.

[17]  Shun-ichi Amari,et al.  Natural Gradient Works Efficiently in Learning , 1998, Neural Computation.

[18]  Kenji Fukumizu,et al.  A Regularity Condition of the Information Matrix of a Multilayer Perceptron Network , 1996, Neural Networks.

[19]  Jonathon Shlens,et al.  Explaining and Harnessing Adversarial Examples , 2014, ICLR.

[20]  Kanter,et al.  Eigenvalues of covariance matrices: Application to neural-network learning. , 1991, Physical review letters.

[21]  Surya Ganguli,et al.  Deep Information Propagation , 2016, ICLR.

[22]  Samuel S. Schoenholz,et al.  Mean Field Residual Networks: On the Edge of Chaos , 2017, NIPS.

[23]  Vardan Papyan,et al.  Measurements of Three-Level Hierarchical Structure in the Outliers in the Spectrum of Deepnet Hessians , 2019, ICML.

[24]  Jascha Sohl-Dickstein,et al.  Dynamical Isometry and a Mean Field Theory of CNNs: How to Train 10, 000-Layer Vanilla Convolutional Neural Networks , 2018, ICML.

[25]  Arthur Jacot,et al.  Freeze and Chaos for DNNs: an NTK view of Batch Normalization, Checkerboard and Boundary Effects , 2019, ArXiv.

[26]  Yoshua Bengio,et al.  Understanding the difficulty of training deep feedforward neural networks , 2010, AISTATS.

[27]  Guigang Zhang,et al.  Deep Learning , 2016, Int. J. Semantic Comput..

[28]  Ryo Karakida,et al.  Understanding Approximate Fisher Information for Fast Convergence of Natural Gradient Descent in Wide Neural Networks , 2020, NeurIPS.

[29]  Surya Ganguli,et al.  The Emergence of Spectral Universality in Deep Networks , 2018, AISTATS.

[30]  Jascha Sohl-Dickstein,et al.  The large learning rate phase of deep learning: the catapult mechanism , 2020, ArXiv.

[31]  Arthur Jacot,et al.  Neural tangent kernel: convergence and generalization in neural networks (invited paper) , 2018, NeurIPS.

[32]  Jascha Sohl-Dickstein,et al.  Sensitivity and Generalization in Neural Networks: an Empirical Study , 2018, ICLR.

[33]  Jonathan Kadmon,et al.  Optimal Architectures in a Solvable Model of Deep Networks , 2016, NIPS.

[34]  Frederik Kunstner,et al.  Limitations of the Empirical Fisher Approximation , 2019, NeurIPS.

[35]  Jaehoon Lee,et al.  Wide neural networks of any depth evolve as linear models under gradient descent , 2019, NeurIPS.

[36]  Shun-ichi Amari,et al.  Universal statistics of Fisher information in deep neural networks: mean field approach , 2018, AISTATS.

[37]  Shankar Krishnan,et al.  An Investigation into Neural Net Optimization via Hessian Eigenvalue Density , 2019, ICML.

[38]  Yann Dauphin,et al.  Empirical Analysis of the Hessian of Over-Parametrized Neural Networks , 2017, ICLR.

[39]  Geoffrey E. Hinton,et al.  Deep Learning , 2015, Nature.

[40]  Tomaso A. Poggio,et al.  Fisher-Rao Metric, Geometry, and Complexity of Neural Networks , 2017, AISTATS.

[41]  Klaus-Robert Müller,et al.  Efficient BackProp , 2012, Neural Networks: Tricks of the Trade.

[42]  Jeffrey Pennington,et al.  The Spectrum of the Fisher Information Matrix of a Single-Hidden-Layer Neural Network , 2018, NeurIPS.

[43]  Jeffrey Pennington,et al.  Nonlinear random matrix theory for deep learning , 2019, NIPS.

[44]  Yann Ollivier,et al.  Riemannian metrics for neural networks I: feedforward networks , 2013, 1303.0818.

[45]  Shun-ichi Amari,et al.  Fisher Information and Natural Gradient Learning of Random Deep Networks , 2018, AISTATS.

[46]  Richard E. Turner,et al.  Gaussian Process Behaviour in Wide Deep Neural Networks , 2018, ICLR.

[47]  Jascha Sohl-Dickstein,et al.  A Mean Field Theory of Batch Normalization , 2019, ICLR.

[48]  Greg Yang,et al.  Scaling Limits of Wide Neural Networks with Weight Sharing: Gaussian Process Behavior, Gradient Independence, and Neural Tangent Kernel Derivation , 2019, ArXiv.

[49]  Shun-ichi Amari,et al.  A method of statistical neurodynamics , 1974, Kybernetik.

[50]  Jeffrey Pennington,et al.  Geometry of Neural Network Loss Surfaces via Random Matrix Theory , 2017, ICML.