Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift

Training Deep Neural Networks is complicated by the fact that the distribution of each layer's inputs changes during training, as the parameters of the previous layers change. This slows down the training by requiring lower learning rates and careful parameter initialization, and makes it notoriously hard to train models with saturating nonlinearities. We refer to this phenomenon as internal covariate shift, and address the problem by normalizing layer inputs. Our method draws its strength from making normalization a part of the model architecture and performing the normalization for each training mini-batch. Batch Normalization allows us to use much higher learning rates and be less careful about initialization, and in some cases eliminates the need for Dropout. Applied to a state-of-the-art image classification model, Batch Normalization achieves the same accuracy with 14 times fewer training steps, and beats the original model by a significant margin. Using an ensemble of batch-normalized networks, we improve upon the best published result on ImageNet classification: reaching 4.82% top-5 test error, exceeding the accuracy of human raters.

[1]  Yoshua Bengio,et al.  Gradient-based learning applied to document recognition , 1998, Proc. IEEE.

[2]  H. Shimodaira,et al.  Improving predictive inference under covariate shift by weighting the log-likelihood function , 2000 .

[3]  Erkki Oja,et al.  Independent component analysis: algorithms and applications , 2000, Neural Networks.

[4]  Simon Haykin,et al.  GradientBased Learning Applied to Document Recognition , 2001 .

[5]  KEVIN A. O’NEIL,et al.  Critical Points of the Singular Value Decomposition , 2005, SIAM J. Matrix Anal. Appl..

[6]  James J. Jiang A Literature Survey on Domain Adaptation of Statistical Classifiers , 2007 .

[7]  Eero P. Simoncelli,et al.  Nonlinear image representation using divisive normalization , 2008, 2008 IEEE Conference on Computer Vision and Pattern Recognition.

[8]  Yoram Singer,et al.  Adaptive Subgradient Methods for Online Learning and Stochastic Optimization , 2011, J. Mach. Learn. Res..

[9]  Geoffrey E. Hinton,et al.  Rectified Linear Units Improve Restricted Boltzmann Machines , 2010, ICML.

[10]  Yoshua Bengio,et al.  Understanding the difficulty of training deep feedforward neural networks , 2010, AISTATS.

[11]  Hermann Ney,et al.  A Convergence Analysis of Log-Linear Training , 2011, NIPS.

[12]  Marc'Aurelio Ranzato,et al.  Large Scale Distributed Deep Networks , 2012, NIPS.

[13]  Klaus-Robert Müller,et al.  Efficient BackProp , 2012, Neural Networks: Tricks of the Trade.

[14]  Tapani Raiko,et al.  Deep Learning Made Easier by Linear Transformations in Perceptrons , 2012, AISTATS.

[15]  Razvan Pascanu,et al.  On the difficulty of training recurrent neural networks , 2012, ICML.

[16]  Geoffrey E. Hinton,et al.  On the importance of initialization and momentum in deep learning , 2013, ICML.

[17]  Hermann Ney,et al.  Mean-normalized stochastic gradient for large-scale deep learning , 2014, 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[18]  Nitish Srivastava,et al.  Dropout: a simple way to prevent neural networks from overfitting , 2014, J. Mach. Learn. Res..

[19]  Surya Ganguli,et al.  Exact solutions to the nonlinear dynamics of learning in deep linear neural networks , 2013, ICLR.

[20]  Xiaohui Zhang,et al.  Parallel training of Deep Neural Networks with Natural Gradient and Parameter Averaging , 2014, ICLR.

[21]  Razvan Pascanu,et al.  Natural Neural Networks , 2015, NIPS.

[22]  Jian Sun,et al.  Delving Deep into Rectifiers: Surpassing Human-Level Performance on ImageNet Classification , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[23]  Dumitru Erhan,et al.  Going deeper with convolutions , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[24]  Michael S. Bernstein,et al.  ImageNet Large Scale Visual Recognition Challenge , 2014, International Journal of Computer Vision.

[25]  Shengen Yan,et al.  Deep Image: Scaling up Image Recognition , 2015, ArXiv.

[26]  Yoshua Bengio,et al.  Knowledge Matters: Importance of Prior Information for Optimization , 2013, J. Mach. Learn. Res..