Understanding Very Deep Networks via Volume Conservation

Recently, very deep neural networks set new records across many application domains, like Residual Networks at the ImageNet challenge and Highway Networks at language processing tasks. We expect further excellent performance improvements in different fields from these very deep networks. However these networks are still poorly understood, especially since they rely on non-standard architectures. In this contribution we analyze the learning dynamics which are required for successfully training very deep neural networks. For the analysis we use a symplectic network architecture which inherently conserves volume when mapping a representation from one to the next layer. Therefore it avoids the vanishing gradient problem, which in turn allows to effectively train thousands of layers. We consider highway and residual networks as well as the LSTM model, all of which have approximately volume conserving mappings. We identified two important factors for making deep architectures working: (1) (near) volume conserving mappings through x = x+f(x) or similar (cf. avoiding the vanishing gradient); (2) Controlling the drift effect, which increases/decreases x during propagation toward the output (cf. avoiding bias shifts);

[1]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[2]  Sergey Ioffe,et al.  Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift , 2015, ICML.

[3]  Michael S. Bernstein,et al.  ImageNet Large Scale Visual Recognition Challenge , 2014, International Journal of Computer Vision.

[4]  Sepp Hochreiter,et al.  The Vanishing Gradient Problem During Learning Recurrent Neural Nets and Problem Solutions , 1998, Int. J. Uncertain. Fuzziness Knowl. Based Syst..

[5]  Geoffrey E. Hinton,et al.  ImageNet classification with deep convolutional neural networks , 2012, Commun. ACM.

[6]  Yoshua Bengio,et al.  Gradient Flow in Recurrent Nets: the Difficulty of Learning Long-Term Dependencies , 2001 .

[7]  Jürgen Schmidhuber,et al.  Long Short-Term Memory , 1997, Neural Computation.

[8]  Gustavo Deco,et al.  Nonlinear higher-order statistical decorrelation by volume-conserving neural architectures , 1995, Neural Networks.

[9]  Sepp Hochreiter,et al.  Untersuchungen zu dynamischen neuronalen Netzen , 1991 .

[10]  Yonghui Wu,et al.  Exploring the Limits of Language Modeling , 2016, ArXiv.

[11]  Yoshua Bengio,et al.  An empirical evaluation of deep architectures on problems with many factors of variation , 2007, ICML '07.

[12]  Andrew Zisserman,et al.  Very Deep Convolutional Networks for Large-Scale Image Recognition , 2014, ICLR.

[13]  Surya Ganguli,et al.  Exact solutions to the nonlinear dynamics of learning in deep linear neural networks , 2013, ICLR.

[14]  Sepp Hochreiter,et al.  Fast and Accurate Deep Network Learning by Exponential Linear Units (ELUs) , 2015, ICLR.

[15]  Jürgen Schmidhuber,et al.  Training Very Deep Networks , 2015, NIPS.

[16]  Yoshua Bengio,et al.  FitNets: Hints for Thin Deep Nets , 2014, ICLR.