NAIS-Net: Stable Deep Networks from Non-Autonomous Differential Equations

This paper introduces "Non-Autonomous Input-Output Stable Network" (NAIS-Net), a very deep architecture where each stacked processing block is derived from a time-invariant non-autonomous dynamical system. Non-autonomy is implemented by skip connections from the block input to each of the unrolled processing stages and allows stability to be enforced so that blocks can be unrolled adaptively to a pattern-dependent processing depth. We prove that the network is globally asymptotically stable so that for every initial condition there is exactly one input-dependent equilibrium assuming tanh units, and multiple stable equilibria for ReLU units. An efficient implementation that enforces the stability under derived conditions for both fully-connected and convolutional layers is also presented. Experimental results show how NAIS-Net exhibits stability in practice, yielding a significant reduction in generalization gap compared to ResNets.

[1]  Charles R. Johnson,et al.  Matrix analysis , 1985, Statistical Inference for Engineers and Data Scientists.

[2]  P. Y. Wu Products of positive semidefinite matrices , 1988 .

[3]  K. Lang,et al.  Learning to tell two spirals apart , 1988 .

[4]  Eduardo D. Sontag,et al.  Mathematical Control Theory: Deterministic Finite Dimensional Systems , 1990 .

[5]  Sepp Hochreiter,et al.  Untersuchungen zu dynamischen neuronalen Netzen , 1991 .

[6]  K. Doya,et al.  Bifurcations in the learning of recurrent neural networks , 1992, [Proceedings] 1992 IEEE International Symposium on Circuits and Systems.

[7]  Yoshua Bengio,et al.  Learning long-term dependencies with gradient descent is difficult , 1994, IEEE Trans. Neural Networks.

[8]  Kurt Hornik,et al.  Universal Approximnation and Learning of Trajectories Using Oscillators , 1995, NIPS.

[9]  Jürgen Schmidhuber,et al.  Long Short-Term Memory , 1997, Neural Computation.

[10]  Uri M. Ascher,et al.  Computer methods for ordinary differential equations and differential-algebraic equations , 1998 .

[11]  U. Ascher,et al.  Numerical Methods for Differential-Algebraic Equations , 1998 .

[12]  Jochen J. Steil,et al.  Input output stability of recurrent neural networks , 1999 .

[13]  Yann LeCun,et al.  The mnist database of handwritten digits , 2005 .

[14]  Jochen J. Steil,et al.  Input space bifurcation manifolds of recurrent neural networks , 2005, Neurocomputing.

[15]  James N. Knight,et al.  Stability analysis of recurrent neural networks with applications , 2008 .

[16]  Alex Krizhevsky,et al.  Learning Multiple Layers of Features from Tiny Images , 2009 .

[17]  Fei-Fei Li,et al.  ImageNet: A large-scale hierarchical image database , 2009, 2009 IEEE Conference on Computer Vision and Pattern Recognition.

[18]  Yann LeCun,et al.  Learning Fast Approximations of Sparse Coding , 2010, ICML.

[19]  Geoffrey E. Hinton,et al.  ImageNet classification with deep convolutional neural networks , 2012, Commun. ACM.

[20]  Razvan Pascanu,et al.  On the difficulty of training recurrent neural networks , 2012, ICML.

[21]  P. Olver Nonlinear Systems , 2013 .

[22]  Yoshua Bengio,et al.  Learning Phrase Representations using RNN Encoder–Decoder for Statistical Machine Translation , 2014, EMNLP.

[23]  Ryan P. Adams,et al.  Avoiding pathologies in very deep networks , 2014, AISTATS.

[24]  Quoc V. Le,et al.  Sequence to Sequence Learning with Neural Networks , 2014, NIPS.

[25]  Joan Bruna,et al.  Intriguing properties of neural networks , 2013, ICLR.

[26]  Sergey Ioffe,et al.  Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift , 2015, ICML.

[27]  Vibhav Vineet,et al.  Conditional Random Fields as Recurrent Neural Networks , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[28]  Jian Sun,et al.  Delving Deep into Rectifiers: Surpassing Human-Level Performance on ImageNet Classification , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[29]  D. Lathrop Nonlinear Dynamics and Chaos: With Applications to Physics, Biology, Chemistry, and Engineering , 2015 .

[30]  Jürgen Schmidhuber,et al.  Highway Networks , 2015, ArXiv.

[31]  Marco Gallieri,et al.  Lasso-MPC – Predictive Control with ℓ1-Regularised Least Squares , 2016 .

[32]  Alex Graves,et al.  Adaptive Computation Time for Recurrent Neural Networks , 2016, ArXiv.

[33]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[34]  Nikita Barabanov,et al.  Stability of discrete time recurrent neural networks and nonlinear optimization problems , 2015, Neural Networks.

[35]  Tomaso A. Poggio,et al.  Bridging the Gaps Between Residual Learning, Recurrent Neural Networks and Visual Cortex , 2016, ArXiv.

[36]  Roland Memisevic,et al.  Regularizing RNNs by Stabilizing Activations , 2015, ICLR.

[37]  Moustapha Cissé,et al.  Parseval Networks: Improving Robustness to Adversarial Examples , 2017, ICML.

[38]  Hairong Liu,et al.  Exploring neural transducers for end-to-end speech recognition , 2017, 2017 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU).

[39]  Eldad Haber,et al.  Stable architectures for deep neural networks , 2017, ArXiv.

[40]  Raquel Urtasun,et al.  The Reversible Residual Network: Backpropagation Without Storing Activations , 2017, NIPS.

[41]  Serge J. Belongie,et al.  Convolutional Networks with Adaptive Computation Graphs , 2017, ArXiv.

[42]  Jürgen Schmidhuber,et al.  Highway and Residual Networks learn Unrolled Iterative Estimation , 2016, ICLR.

[43]  Kilian Q. Weinberger,et al.  Densely Connected Convolutional Networks , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[44]  Long Chen,et al.  Maximum Principle Based Algorithms for Deep Learning , 2017, J. Mach. Learn. Res..

[45]  Christopher Joseph Pal,et al.  On orthogonality and learning recurrent networks with long term dependencies , 2017, ICML.

[46]  Jürgen Schmidhuber,et al.  Recurrent Highway Networks , 2016, ICML.

[47]  Dahua Lin,et al.  PolyNet: A Pursuit of Structural Diversity in Very Deep Networks , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[48]  Dmitry P. Vetrov,et al.  Probabilistic Adaptive Computation Time , 2017, Bulletin of the Polish Academy of Sciences Technical Sciences.

[49]  Yasuhiro Fujiwara,et al.  Preventing Gradient Explosions in Gated Recurrent Units , 2017, NIPS.

[50]  Gregory Shakhnarovich,et al.  FractalNet: Ultra-Deep Neural Networks without Residuals , 2016, ICLR.

[51]  E Weinan,et al.  A Proposal on Machine Learning via Dynamical Systems , 2017 .

[52]  Yuichi Yoshida,et al.  Spectral Norm Regularization for Improving the Generalizability of Deep Learning , 2017, ArXiv.

[53]  Jonathan Masci,et al.  Geometric Deep Learning on Graphs and Manifolds Using Mixture Model CNNs , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[54]  Thomas Laurent,et al.  A recurrent neural network without chaos , 2016, ICLR.

[55]  Yoshua Bengio,et al.  Residual Connections Encourage Iterative Inference , 2017, ICLR.

[56]  Yann Ollivier,et al.  Can recurrent neural networks warp time? , 2018, ICLR.

[57]  Bin Dong,et al.  Beyond Finite Layer Neural Networks: Bridging Deep Architectures and Numerical Differential Equations , 2017, ICML.

[58]  Yuichi Yoshida,et al.  Spectral Normalization for Generative Adversarial Networks , 2018, ICLR.

[59]  Frederick Tung,et al.  Multi-level Residual Networks from Dynamical Systems View , 2017, ICLR.