AutoInit: Analytic Signal-Preserving Weight Initialization for Neural Networks

Neural networks require careful weight initialization to prevent signals from exploding or vanishing. Existing initialization schemes solve this problem in specific cases by assuming that the network has a certain activation function or topology. It is difficult to derive such weight initialization strategies, and modern architectures therefore often use these same initialization schemes even though their assumptions do not hold. This paper introduces AutoInit, a weight initialization algorithm that automatically adapts to different neural network architectures. By analytically tracking the mean and variance of signals as they propagate through the network, AutoInit is able to appropriately scale the weights at each layer to avoid exploding or vanishing signals. Experiments demonstrate that AutoInit improves performance of various convolutional and residual networks across a range of activation function, dropout, weight decay, learning rate, and normalizer settings. Further, in neural architecture search and activation function meta-learning, AutoInit automatically calculates specialized weight initialization strategies for thousands of unique architectures and hundreds of unique activation functions, and improves performance in vision, language, tabular, multi-task, and transfer learning scenarios. AutoInit thus serves as an automatic configuration tool that makes design of new neural network architectures more robust. The AutoInit package provides a wrapper around existing TensorFlow models and is available at https://github.com/cognizant-ai-labs/autoinit.

[1]  Li Fei-Fei,et al.  ImageNet: A large-scale hierarchical image database , 2009, CVPR.

[2]  Joel Nothman,et al.  SciPy 1.0-Fundamental Algorithms for Scientific Computing in Python , 2019, ArXiv.

[3]  Frank Hutter,et al.  Neural Architecture Search: A Survey , 2018, J. Mach. Learn. Res..

[4]  Risto Miikkulainen,et al.  Evolving Neural Networks through Augmenting Topologies , 2002, Evolutionary Computation.

[5]  Yoshua Bengio,et al.  BinaryConnect: Training Deep Neural Networks with binary weights during propagations , 2015, NIPS.

[6]  Martin Wistuba,et al.  A Survey on Neural Architecture Search , 2019, ArXiv.

[7]  Jürgen Schmidhuber,et al.  Long Short-Term Memory , 1997, Neural Computation.

[8]  Frank Hutter,et al.  SGDR: Stochastic Gradient Descent with Warm Restarts , 2016, ICLR.

[9]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[10]  Saining Xie,et al.  On Network Design Spaces for Visual Recognition , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[11]  Jascha Sohl-Dickstein,et al.  Dynamical Isometry and a Mean Field Theory of CNNs: How to Train 10, 000-Layer Vanilla Convolutional Neural Networks , 2018, ICML.

[12]  Elliot Meyerson,et al.  Evolving Deep Neural Networks , 2017, Artificial Intelligence in the Age of Neural Networks and Brain Computing.

[13]  Yoshua Bengio,et al.  How to Initialize your Network? Robust Initialization for WeightNorm & ResNets , 2019, NeurIPS.

[14]  Geoffrey E. Hinton,et al.  Layer Normalization , 2016, ArXiv.

[15]  Nitish Srivastava,et al.  Dropout: a simple way to prevent neural networks from overfitting , 2014, J. Mach. Learn. Res..

[16]  Kenneth A. De Jong,et al.  Cooperative Coevolution: An Architecture for Evolving Coadapted Subcomponents , 2000, Evolutionary Computation.

[17]  Joshua B. Tenenbaum,et al.  Human-level concept learning through probabilistic program induction , 2015, Science.

[18]  Yuan Yu,et al.  TensorFlow: A system for large-scale machine learning , 2016, OSDI.

[19]  Samuel L. Smith,et al.  Batch Normalization Biases Residual Blocks Towards the Identity Function in Deep Networks , 2020, NeurIPS.

[20]  Samuel S. Schoenholz,et al.  The Dynamics of Signal Propagation in Gated Recurrent Neural Networks , 2019 .

[21]  Garrison W. Cottrell,et al.  ReZero is All You Need: Fast Convergence at Large Depth , 2020, UAI.

[22]  Sepp Hochreiter,et al.  Fast and Accurate Deep Network Learning by Exponential Linear Units (ELUs) , 2015, ICLR.

[23]  Lei Huang,et al.  Centered Weight Normalization in Accelerating Training of Deep Neural Networks , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[24]  Sergey Ioffe,et al.  Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift , 2015, ICML.

[25]  Thomas Brox,et al.  Striving for Simplicity: The All Convolutional Net , 2014, ICLR.

[26]  Kilian Q. Weinberger,et al.  Densely Connected Convolutional Networks , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[27]  Quoc V. Le,et al.  A Bayesian Perspective on Generalization and Stochastic Gradient Descent , 2017, ICLR.

[28]  Risto Miikkulainen,et al.  Evolutionary optimization of deep learning activation functions , 2020, GECCO.

[29]  Yoshua Bengio,et al.  Generative Adversarial Nets , 2014, NIPS.

[30]  Surya Ganguli,et al.  Exact solutions to the nonlinear dynamics of learning in deep linear neural networks , 2013, ICLR.

[31]  Samuel L. Smith,et al.  Characterizing signal propagation to close the performance gap in unnormalized ResNets , 2021, ICLR.

[32]  K. Singleton,et al.  An omnibus test for the two-sample problem using the empirical characteristic function , 1986 .

[33]  Roberto Prevete,et al.  A survey on modern trainable activation functions , 2020, Neural Networks.

[34]  Quoc V. Le,et al.  Evolving Normalization-Activation Layers , 2020, NeurIPS.

[35]  Natalia Gimelshein,et al.  PyTorch: An Imperative Style, High-Performance Deep Learning Library , 2019, NeurIPS.

[36]  Boris Hanin,et al.  Which Neural Net Architectures Give Rise To Exploding and Vanishing Gradients? , 2018, NeurIPS.

[37]  Samuel S. Schoenholz,et al.  Dynamical Isometry and a Mean Field Theory of RNNs: Gating Enables Signal Propagation in Recurrent Neural Networks , 2018, ICML.

[38]  Jian Sun,et al.  Delving Deep into Rectifiers: Surpassing Human-Level Performance on ImageNet Classification , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[39]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[40]  Elliot Meyerson,et al.  Evolutionary neural AutoML for deep learning , 2019, GECCO.

[41]  Nikos Komodakis,et al.  Wide Residual Networks , 2016, BMVC.

[42]  H. B. Mann,et al.  On a Test of Whether one of Two Random Variables is Stochastically Larger than the Other , 1947 .

[43]  Masato Taki Deep Residual Networks and Weight Initialization , 2017, ArXiv.

[44]  Risto Miikkulainen,et al.  Forming Neural Networks Through Efficient and Adaptive Coevolution , 1997, Evolutionary Computation.

[45]  Quoc V. Le,et al.  Smooth Adversarial Training , 2020, ArXiv.

[46]  K. Jarrod Millman,et al.  Array programming with NumPy , 2020, Nat..

[47]  Stephen Marshall,et al.  Activation Functions: Comparison of trends in Practice and Research for Deep Learning , 2018, ArXiv.

[48]  Sepp Hochreiter,et al.  Self-Normalizing Neural Networks , 2017, NIPS.

[49]  Jeffrey Pennington,et al.  Nonlinear random matrix theory for deep learning , 2019, NIPS.

[50]  Klaus-Robert Müller,et al.  Efficient BackProp , 2012, Neural Networks: Tricks of the Trade.

[51]  Jeffrey Pennington,et al.  The Spectrum of the Fisher Information Matrix of a Single-Hidden-Layer Neural Network , 2018, NeurIPS.

[52]  Risto Miikkulainen,et al.  Discovering Parametric Activation Functions , 2020, Neural Networks.

[53]  Quoc V. Le,et al.  Don't Decay the Learning Rate, Increase the Batch Size , 2017, ICLR.

[54]  Peter M. Roth,et al.  The Quest for the Golden Activation Function , 2018, ArXiv.

[55]  Surya Ganguli,et al.  Resurrecting the sigmoid in deep learning through dynamical isometry: theory and practice , 2017, NIPS.

[56]  Tengyu Ma,et al.  Fixup Initialization: Residual Learning Without Normalization , 2019, ICLR.

[57]  Tim Salimans,et al.  Weight Normalization: A Simple Reparameterization to Accelerate Training of Deep Neural Networks , 2016, NIPS.

[58]  Andrew L. Maas Rectifier Nonlinearities Improve Neural Network Acoustic Models , 2013 .

[59]  Geoffrey E. Hinton,et al.  Rectified Linear Units Improve Restricted Boltzmann Machines , 2010, ICML.

[60]  Yoshua Bengio,et al.  Empirical Evaluation of Gated Recurrent Neural Networks on Sequence Modeling , 2014, ArXiv.

[61]  Alex Krizhevsky,et al.  Learning Multiple Layers of Features from Tiny Images , 2009 .

[62]  Kenji Doya,et al.  Sigmoid-Weighted Linear Units for Neural Network Function Approximation in Reinforcement Learning , 2017, Neural Networks.

[63]  Lucas Dixon,et al.  Ex Machina: Personal Attacks Seen at Scale , 2016, WWW.

[64]  Lukasz Kaiser,et al.  Attention is All you Need , 2017, NIPS.

[65]  Trevor Darrell,et al.  Data-dependent Initializations of Convolutional Neural Networks , 2015, ICLR.

[66]  Kevin Gimpel,et al.  Adjusting for Dropout Variance in Batch Normalization and Weight Initialization , 2016 .

[67]  David Rolnick,et al.  How to Start Training: The Effect of Initialization and Architecture , 2018, NeurIPS.

[68]  Quoc V. Le,et al.  Searching for Activation Functions , 2018, arXiv.

[69]  Ron Kohavi,et al.  Scaling Up the Accuracy of Naive-Bayes Classifiers: A Decision-Tree Hybrid , 1996, KDD.

[70]  Robert Piessens,et al.  Quadpack: A Subroutine Package for Automatic Integration , 2011 .

[71]  Yoshua Bengio,et al.  Three Factors Influencing Minima in SGD , 2017, ArXiv.

[72]  Jeffrey Pennington,et al.  Provable Benefit of Orthogonal Initialization in Optimizing Deep Linear Networks , 2020, ICLR.

[73]  Yoshua Bengio,et al.  Understanding the difficulty of training deep feedforward neural networks , 2010, AISTATS.

[74]  Jiri Matas,et al.  All you need is a good init , 2015, ICLR.

[75]  Jian Sun,et al.  Identity Mappings in Deep Residual Networks , 2016, ECCV.

[76]  Andrew Zisserman,et al.  Automated Flower Classification over a Large Number of Classes , 2008, 2008 Sixth Indian Conference on Computer Vision, Graphics & Image Processing.

[77]  Kevin Gimpel,et al.  Gaussian Error Linear Units (GELUs) , 2016 .

[78]  Sepp Hochreiter,et al.  Untersuchungen zu dynamischen neuronalen Netzen , 1991 .

[79]  Randal S. Olson,et al.  PMLB: a large benchmark suite for machine learning evaluation and comparison , 2017, BioData Mining.

[80]  F. Massey The Kolmogorov-Smirnov Test for Goodness of Fit , 1951 .