Transport Analysis of Infinitely Deep Neural Network

We investigated the feature map inside deep neural networks (DNNs) by tracking the transport map. We are interested in the role of depth--why do DNNs perform better than shallow models?--and the interpretation of DNNs--what do intermediate layers do? Despite the rapid development in their application, DNNs remain analytically unexplained because the hidden layers are nested and the parameters are not faithful. Inspired by the integral representation of shallow NNs, which is the continuum limit of the width, or the hidden unit number, we developed the flow representation and transport analysis of DNNs. The flow representation is the continuum limit of the depth, or the hidden layer number, and it is specified by an ordinary differential equation (ODE) with a vector field. We interpret an ordinary DNN as a transport map or an Euler broken line approximation of the flow. Technically speaking, a dynamical system is a natural model for the nested feature maps. In addition, it opens a new way to the coordinate-free treatment of DNNs by avoiding the redundant parametrization of DNNs. Following Wasserstein geometry, we analyze a flow in three aspects: dynamical system, continuity equation, and Wasserstein gradient flow. A key finding is that we specified a series of transport maps of the denoising autoencoder (DAE), which is a cornerstone for the development of deep learning. Starting from the shallow DAE, this paper develops three topics: the transport map of the deep DAE, the equivalence between the stacked DAE and the composition of DAEs, and the development of the double continuum limit or the integral representation of the flow representation. As partial answers to the research questions, we found that deeper DAEs converge faster and the extracted features are better; in addition, a deep Gaussian DAE transports mass to decrease the Shannon entropy of the data distribution. We expect that further investigations on these questions lead to the development of an interpretable and principled alternatives to DNNs.

[1]  Nicolas Le Roux,et al.  Convex Neural Networks , 2005, NIPS.

[2]  David Duvenaud,et al.  Neural Ordinary Differential Equations , 2018, NeurIPS.

[3]  Sergey Ioffe,et al.  Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift , 2015, ICML.

[4]  Nadav Cohen,et al.  On the Expressive Power of Deep Learning: A Tensor Analysis , 2015, COLT 2016.

[5]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[6]  Lorenzo Rosasco,et al.  Why and when can deep-but not shallow-networks avoid the curse of dimensionality: A review , 2016, International Journal of Automation and Computing.

[7]  Noboru Murata,et al.  An Integral Representation of Functions Using Three-layered Networks and Their Approximation Bounds , 1996, Neural Networks.

[8]  Behnam Neyshabur,et al.  Implicit Regularization in Deep Learning , 2017, ArXiv.

[9]  Francis Bach,et al.  On the Global Convergence of Gradient Descent for Over-parameterized Models using Optimal Transport , 2018, NeurIPS.

[10]  Surya Ganguli,et al.  Deep Unsupervised Learning using Nonequilibrium Thermodynamics , 2015, ICML.

[11]  Geoffrey E. Hinton,et al.  ImageNet classification with deep convolutional neural networks , 2012, Commun. ACM.

[12]  Geoffrey E. Hinton Connectionist Learning Procedures , 1989, Artif. Intell..

[13]  Pascal Vincent,et al.  Contractive Auto-Encoders: Explicit Invariance During Feature Extraction , 2011, ICML.

[14]  Raquel Urtasun,et al.  The Reversible Residual Network: Backpropagation Without Storing Activations , 2017, NIPS.

[15]  Pascal Vincent,et al.  Generalized Denoising Auto-Encoders as Generative Models , 2013, NIPS.

[16]  Andrew R. Barron,et al.  Minimax lower bounds for ridge combinations including neural nets , 2017, 2017 IEEE International Symposium on Information Theory (ISIT).

[17]  Yoshua Bengio,et al.  Generative Adversarial Nets , 2014, NIPS.

[18]  Pascal Vincent,et al.  A Connection Between Score Matching and Denoising Autoencoders , 2011, Neural Computation.

[19]  Yoav Freund,et al.  Boosting: Foundations and Algorithms , 2012 .

[20]  Allan Pinkus,et al.  Density in Approximation Theory , 2005 .

[21]  Yale Chang,et al.  Unsupervised Feature Learning via Sparse Hierarchical Representations [ 1 ] , 2014 .

[22]  Ming Yang,et al.  DeepFace: Closing the Gap to Human-Level Performance in Face Verification , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[23]  Feng Liang,et al.  Improved minimax predictive densities under Kullback-Leibler loss , 2006 .

[24]  Andrew R. Barron,et al.  Approximation by Combinations of ReLU and Squared ReLU Ridge Functions With $\ell^1$ and $\ell^0$ Controls , 2016, IEEE Transactions on Information Theory.

[25]  Soumith Chintala,et al.  Unsupervised Representation Learning with Deep Convolutional Generative Adversarial Networks , 2015, ICLR.

[26]  Alexandre B. Tsybakov,et al.  Introduction to Nonparametric Estimation , 2008, Springer series in statistics.

[27]  Max Tegmark,et al.  Why Does Deep and Cheap Learning Work So Well? , 2016, Journal of Statistical Physics.

[28]  Noboru Murata,et al.  Transportation analysis of denoising autoencoders: a novel method for analyzing deep neural networks , 2017, ArXiv.

[29]  Francis R. Bach,et al.  Breaking the Curse of Dimensionality with Convex Neural Networks , 2014, J. Mach. Learn. Res..

[30]  Francis R. Bach,et al.  On the Equivalence between Kernel Quadrature Rules and Random Feature Expansions , 2015, J. Mach. Learn. Res..

[31]  Noboru Murata,et al.  Integral representation of shallow neural network that attains the global minimum. , 2018, 1805.07517.

[32]  Andrew R. Barron,et al.  Universal approximation bounds for superpositions of a sigmoidal function , 1993, IEEE Trans. Inf. Theory.

[33]  Taiji Suzuki,et al.  Functional Gradient Boosting based on Residual Network Perception , 2018, ICML.

[34]  Kenji Fukumizu,et al.  Deep Neural Networks Learn Non-Smooth Functions Effectively , 2018, AISTATS.

[35]  Samy Bengio,et al.  Understanding deep learning requires rethinking generalization , 2016, ICLR.

[36]  Yoshua Bengio,et al.  Deep Generative Stochastic Networks Trainable by Backprop , 2013, ICML.

[37]  Lawrence Carin,et al.  Policy Optimization as Wasserstein Gradient Flows , 2018, ICML.

[38]  Rob Fergus,et al.  Visualizing and Understanding Convolutional Networks , 2013, ECCV.

[39]  Geoffrey E. Hinton,et al.  Deep Boltzmann Machines , 2009, AISTATS.

[40]  Max Welling,et al.  Auto-Encoding Variational Bayes , 2013, ICLR.

[41]  Vra Krková Complexity estimates based on integral transforms induced by computational units , 2012 .

[42]  Ohad Shamir,et al.  The Power of Depth for Feedforward Neural Networks , 2015, COLT.

[43]  Pascal Vincent,et al.  Representation Learning: A Review and New Perspectives , 2012, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[44]  Matthias Hein,et al.  The Loss Surface of Deep and Wide Neural Networks , 2017, ICML.

[45]  N. Murata,et al.  Double Continuum Limit of Deep Neural Networks , 2017 .

[46]  Richard G. Baraniuk,et al.  A Probabilistic Theory of Deep Learning , 2015, ArXiv.

[47]  Pascal Vincent,et al.  Stacked Denoising Autoencoders: Learning Useful Representations in a Deep Network with a Local Denoising Criterion , 2010, J. Mach. Learn. Res..

[48]  Dilin Wang,et al.  Stein Variational Gradient Descent: A General Purpose Bayesian Inference Algorithm , 2016, NIPS.

[49]  Lorenzo Rosasco,et al.  Unsupervised learning of invariant representations , 2016, Theor. Comput. Sci..

[50]  Kaare Brandt Petersen,et al.  The Matrix Cookbook , 2006 .

[51]  L. Ambrosio,et al.  Gradient Flows: In Metric Spaces and in the Space of Probability Measures , 2005 .

[52]  Max Welling,et al.  Semi-supervised Learning with Deep Generative Models , 2014, NIPS.

[53]  Qianxiao Li,et al.  An Optimal Control Approach to Deep Learning and Applications to Discrete-Weight Neural Networks , 2018, ICML.

[54]  Pascal Vincent,et al.  GSNs : Generative Stochastic Networks , 2015, ArXiv.

[55]  Eldad Haber,et al.  Stable architectures for deep neural networks , 2017, ArXiv.

[56]  Geoffrey E. Hinton,et al.  Bayesian Learning for Neural Networks , 1995 .

[57]  Surya Ganguli,et al.  Identifying and attacking the saddle point problem in high-dimensional non-convex optimization , 2014, NIPS.

[58]  Lorenzo Rosasco,et al.  On Invariance in Hierarchical Models , 2009, NIPS.

[59]  Asuka Takatsu Wasserstein geometry of Gaussian measures , 2011 .

[60]  Daniel Soudry,et al.  No bad local minima: Data independent training error guarantees for multilayer neural networks , 2016, ArXiv.

[61]  Thomas Hofmann,et al.  Greedy Layer-Wise Training of Deep Networks , 2007 .

[62]  Matus Telgarsky,et al.  Benefits of Depth in Neural Networks , 2016, COLT.

[63]  Noboru Murata,et al.  Neural Network with Unbounded Activation Functions is Universal Approximator , 2015, 1505.03654.

[64]  Dmitry Yarotsky,et al.  Error bounds for approximations with deep ReLU networks , 2016, Neural Networks.

[65]  Yann LeCun,et al.  The Loss Surfaces of Multilayer Networks , 2014, AISTATS.

[66]  L. Evans Measure theory and fine properties of functions , 1992 .

[67]  C. Villani Optimal Transport: Old and New , 2008 .

[68]  Léon Bottou,et al.  Wasserstein Generative Adversarial Networks , 2017, ICML.

[69]  Grace Wahba,et al.  Spline Models for Observational Data , 1990 .

[70]  Y. Brenier Polar Factorization and Monotone Rearrangement of Vector-Valued Functions , 1991 .

[71]  Stephen P. Boyd,et al.  Convex Optimization , 2004, Algorithms and Theory of Computation Handbook.

[72]  Rich Caruana,et al.  Do Deep Nets Really Need to be Deep? , 2013, NIPS.

[73]  Shakir Mohamed,et al.  Variational Inference with Normalizing Flows , 2015, ICML.

[74]  C. Villani,et al.  Generalization of an Inequality by Talagrand and Links with the Logarithmic Sobolev Inequality , 2000 .

[75]  Nello Cristianini,et al.  Kernel Methods for Pattern Analysis , 2006 .

[76]  Johannes Schmidt-Hieber,et al.  Nonparametric regression using deep neural networks with ReLU activation function , 2017, The Annals of Statistics.

[77]  Sanjeev Arora,et al.  On the Optimization of Deep Networks: Implicit Acceleration by Overparameterization , 2018, ICML.

[78]  Stéphane Mallat,et al.  Invariant Scattering Convolution Networks , 2012, IEEE transactions on pattern analysis and machine intelligence.

[79]  Yoshua Bengio,et al.  What regularized auto-encoders learn from the data-generating distribution , 2012, J. Mach. Learn. Res..

[80]  Bin Dong,et al.  Beyond Finite Layer Neural Networks: Bridging Deep Architectures and Numerical Differential Equations , 2017, ICML.

[81]  Peter Glöckner,et al.  Why Does Unsupervised Pre-training Help Deep Learning? , 2013 .

[82]  Taiji Suzuki,et al.  Fast generalization error bound of deep learning from a kernel perspective , 2018, AISTATS.

[83]  Kenji Kawaguchi,et al.  Deep Learning without Poor Local Minima , 2016, NIPS.

[84]  Yoshua Bengio,et al.  Extracting and composing robust features with denoising autoencoders , 2008, ICML '08.

[85]  Noboru Murata,et al.  Sampling Hidden Parameters from Oracle Distribution , 2014, ICANN.