Are All Layers Created Equal?

Understanding deep neural networks has been a major research objective in recent years with notable theoretical progress. A focal point of those studies stems from the success of excessively large networks which defy the classical wisdom of uniform convergence and learnability. We study empirically the layer-wise functional structure of overparameterized deep models. We provide evidence for the heterogeneous characteristic of layers. To do so, we introduce the notion of robustness to post-training re-initialization and re-randomization. We show that the layers can be categorized as either ``ambient'' or ``critical''. Resetting the ambient layers to their initial values has no negative consequence, and in many cases they barely change throughout training. On the contrary, resetting the critical layers completely destroys the predictor and the performance drops to chanceh. Our study provides further evidence that mere parameter counting or norm accounting is too coarse in studying generalization of deep models, and flatness or robustness analysis of the models needs to respect the network architectures.

[1]  George Cybenko,et al.  Approximation by superpositions of a sigmoidal function , 1989, Math. Control. Signals Syst..

[2]  Kurt Hornik,et al.  Multilayer feedforward networks are universal approximators , 1989, Neural Networks.

[3]  Kurt Hornik,et al.  Approximation capabilities of multilayer feedforward networks , 1991, Neural Networks.

[4]  László Györfi,et al.  A Probabilistic Theory of Pattern Recognition , 1996, Stochastic Modelling and Applied Probability.

[5]  Jürgen Schmidhuber,et al.  Flat Minima , 1997, Neural Computation.

[6]  Vladimir Vapnik,et al.  Statistical learning theory , 1998 .

[7]  Peter L. Bartlett,et al.  Neural Network Learning - Theoretical Foundations , 1999 .

[8]  Allan Pinkus,et al.  Approximation theory of the MLP model in neural networks , 1999, Acta Numerica.

[9]  Adam Krzyzak,et al.  A Distribution-Free Theory of Nonparametric Regression , 2002, Springer series in statistics.

[10]  Benjamin Recht,et al.  Random Features for Large-Scale Kernel Machines , 2007, NIPS.

[11]  AI Koan Weighted Sums of Random Kitchen Sinks : Replacing minimization with randomization in learning , 2008 .

[12]  Li Fei-Fei,et al.  ImageNet: A large-scale hierarchical image database , 2009, CVPR.

[13]  Weifeng Liu,et al.  Adaptive and Learning Systems for Signal Processing, Communication, and Control , 2010 .

[14]  Yoshua Bengio,et al.  Shallow vs. Deep Sum-Product Networks , 2011, NIPS.

[15]  J. T. Spooner,et al.  Adaptive and Learning Systems for Signal Processing , Communications , and Control , 2013 .

[16]  D. Costarelli,et al.  Constructive Approximation by Superposition of Sigmoidal Functions , 2013 .

[17]  Thorsten Brants,et al.  One billion word benchmark for measuring progress in statistical language modeling , 2013, INTERSPEECH.

[18]  Razvan Pascanu,et al.  On the Number of Linear Regions of Deep Neural Networks , 2014, NIPS.

[19]  Joan Bruna,et al.  Intriguing properties of neural networks , 2013, ICLR.

[20]  Geoffrey E. Hinton,et al.  Distilling the Knowledge in a Neural Network , 2015, ArXiv.

[21]  Ruslan Salakhutdinov,et al.  Path-SGD: Path-Normalized Optimization in Deep Neural Networks , 2015, NIPS.

[22]  Alexander Cloninger,et al.  Provable approximation properties for deep neural networks , 2015, ArXiv.

[23]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[24]  Zhanxing Zhu,et al.  Neural Information Processing Systems (NIPS) , 2015 .

[25]  Jonathon Shlens,et al.  Explaining and Harnessing Adversarial Examples , 2014, ICLR.

[26]  Andrew Zisserman,et al.  Very Deep Convolutional Networks for Large-Scale Image Recognition , 2014, ICLR.

[27]  Serge J. Belongie,et al.  Residual Networks Behave Like Ensembles of Relatively Shallow Networks , 2016, NIPS.

[28]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[29]  Matus Telgarsky,et al.  Benefits of Depth in Neural Networks , 2016, COLT.

[30]  Song Han,et al.  Deep Compression: Compressing Deep Neural Network with Pruning, Trained Quantization and Huffman Coding , 2015, ICLR.

[31]  Jian Sun,et al.  Identity Mappings in Deep Residual Networks , 2016, ECCV.

[32]  T. Poggio,et al.  Deep vs. shallow networks : An approximation theory perspective , 2016, ArXiv.

[33]  Ohad Shamir,et al.  The Power of Depth for Feedforward Neural Networks , 2015, COLT.

[34]  Yoram Singer,et al.  Toward Deeper Understanding of Neural Networks: The Power of Initialization and a Dual View on Expressivity , 2016, NIPS.

[35]  Lukasz Kaiser,et al.  Attention is All you Need , 2017, NIPS.

[36]  Leslie Pack Kaelbling,et al.  Generalization in Deep Learning , 2017, ArXiv.

[37]  Gintare Karolina Dziugaite,et al.  Computing Nonvacuous Generalization Bounds for Deep (Stochastic) Neural Networks with Many More Parameters than Training Data , 2017, UAI.

[38]  Samy Bengio,et al.  Understanding deep learning requires rethinking generalization , 2016, ICLR.

[39]  Jorge Nocedal,et al.  On Large-Batch Training for Deep Learning: Generalization Gap and Sharp Minima , 2016, ICLR.

[40]  Tengyu Ma,et al.  Identity Matters in Deep Learning , 2016, ICLR.

[41]  Matus Telgarsky,et al.  Spectrally-normalized margin bounds for neural networks , 2017, NIPS.

[42]  Stefano Soatto,et al.  Entropy-SGD: biasing gradient descent into wide valleys , 2016, ICLR.

[43]  Nathan Srebro,et al.  Exploring Generalization in Deep Learning , 2017, NIPS.

[44]  Mikhail Belkin,et al.  To understand deep learning we need to understand kernel learning , 2018, ICML.

[45]  Huan Wang,et al.  Identifying Generalization Properties in Neural Networks , 2018, ArXiv.

[46]  A. Jadbabaie,et al.  Finite sample expressive power of small-width ReLU networks , 2018, ArXiv.

[47]  David A. McAllester,et al.  A PAC-Bayesian Approach to Spectrally-Normalized Margin Bounds for Neural Networks , 2017, ICLR.

[48]  Mikhail Belkin,et al.  Overfitting or perfect fitting? Risk bounds for classification and regression rules that interpolate , 2018, NeurIPS.

[49]  David Rolnick,et al.  The power of deeper networks for expressing natural functions , 2017, ICLR.

[50]  David A. Wagner,et al.  Obfuscated Gradients Give a False Sense of Security: Circumventing Defenses to Adversarial Examples , 2018, ICML.

[51]  Tomaso A. Poggio,et al.  Theory IIIb: Generalization in Deep Networks , 2018, ArXiv.

[52]  Lukasz Kaiser,et al.  Generating Wikipedia by Summarizing Long Sequences , 2018, ICLR.

[53]  Aleksander Madry,et al.  Towards Deep Learning Models Resistant to Adversarial Attacks , 2017, ICLR.

[54]  Yi Zhang,et al.  Stronger generalization bounds for deep nets via a compression approach , 2018, ICML.

[55]  Quoc V. Le,et al.  A Bayesian Perspective on Generalization and Stochastic Gradient Descent , 2017, ICLR.

[56]  Matthias Hein,et al.  Optimization Landscape and Expressivity of Deep CNNs , 2017, ICML.

[57]  Mikhail Belkin,et al.  Reconciling modern machine learning and the bias-variance trade-off , 2018, ArXiv.

[58]  Arthur Jacot,et al.  Neural tangent kernel: convergence and generalization in neural networks (invited paper) , 2018, NeurIPS.

[59]  Liwei Wang,et al.  Gradient Descent Finds Global Minima of Deep Neural Networks , 2018, ICML.

[60]  Babak Hassibi,et al.  Stochastic Mirror Descent on Overparameterized Nonlinear Models , 2019, IEEE Transactions on Neural Networks and Learning Systems.

[61]  Yuan Cao,et al.  Stochastic Gradient Descent Optimizes Over-parameterized Deep ReLU Networks , 2018, ArXiv.

[62]  Suvrit Sra,et al.  Small ReLU networks are powerful memorizers: a tight analysis of memorization capacity , 2018, NeurIPS.

[63]  Yuanzhi Li,et al.  A Convergence Theory for Deep Learning via Over-Parameterization , 2018, ICML.

[64]  Ryan P. Adams,et al.  Non-vacuous Generalization Bounds at the ImageNet Scale: a PAC-Bayesian Compression Approach , 2018, ICLR.

[65]  John K. Tsotsos,et al.  Intriguing Properties of Randomly Weighted Networks: Generalizing While Learning Next to Nothing , 2018, 2019 16th Conference on Computer and Robot Vision (CRV).

[66]  Jaehoon Lee,et al.  Wide neural networks of any depth evolve as linear models under gradient descent , 2019, NeurIPS.

[67]  Barnabás Póczos,et al.  Gradient Descent Provably Optimizes Over-parameterized Neural Networks , 2018, ICLR.

[68]  Tomaso A. Poggio,et al.  Fisher-Rao Metric, Geometry, and Complexity of Neural Networks , 2017, AISTATS.

[69]  Niladri S. Chatterji,et al.  The intriguing role of module criticality in the generalization of deep networks , 2019, ICLR.

[70]  Colin Raffel,et al.  Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer , 2019, J. Mach. Learn. Res..

[71]  S. Gelly,et al.  An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale , 2020, ICLR.

[72]  A. Dosovitskiy,et al.  MLP-Mixer: An all-MLP Architecture for Vision , 2021, NeurIPS.

[73]  Ariel Kleiner,et al.  Sharpness-Aware Minimization for Efficiently Improving Generalization , 2020, ICLR.