A Bayesian Perspective of Convolutional Neural Networks through a Deconvolutional Generative Model

Inspired by the success of Convolutional Neural Networks (CNNs) for supervised prediction in images, we design the Deconvolutional Generative Model (DGM), a new probabilistic generative model whose inference calculations correspond to those in a given CNN architecture. The DGM uses a CNN to design the prior distribution in the probabilistic model. Furthermore, the DGM generates images from coarse to finer scales. It introduces a small set of latent variables at each scale, and enforces dependencies among all the latent variables via a conjugate prior distribution. This conjugate prior yields a new regularizer based on paths rendered in the generative model for training CNNs-the Rendering Path Normalization (RPN). We demonstrate that this regularizer improves generalization, both in theory and in practice. In addition, likelihood estimation in the DGM yields training losses for CNNs, and inspired by this, we design a new loss termed as the Max-Min cross entropy which outperforms the traditional cross-entropy loss for object classification. The Max-Min cross entropy suggests a new deep network architecture, namely the Max-Min network, which can learn from less labeled data while maintaining good prediction performance. Our experiments demonstrate that the DGM with the RPN and the Max-Min architecture exceeds or matches the-state-of-art on benchmarks including SVHN, CIFAR10, and CIFAR100 for semi-supervised and supervised learning tasks.

[1]  J. Kiefer,et al.  Stochastic Estimation of the Maximum of a Regression Function , 1952 .

[2]  D. Rubin,et al.  Maximum likelihood from incomplete data via the EM - algorithm plus discussions on the paper , 1977 .

[3]  R. Dudley Central Limit Theorems for Empirical Measures , 1978 .

[4]  László Györfi,et al.  A Probabilistic Theory of Pattern Recognition , 1996, Stochastic Modelling and Applied Probability.

[5]  Jon A. Wellner,et al.  Weak Convergence and Empirical Processes: With Applications to Statistics , 1996 .

[6]  Rajesh P. N. Rao,et al.  Predictive coding in the visual cortex: a functional interpretation of some extra-classical receptive-field effects. , 1999 .

[7]  Naftali Tishby,et al.  The information bottleneck method , 2000, ArXiv.

[8]  S. R. Jammalamadaka,et al.  Empirical Processes in M-Estimation , 2001 .

[9]  V. Koltchinskii,et al.  Empirical margin distributions and bounding the generalization error of combined classifiers , 2002, math/0405343.

[10]  Christopher M. Bishop,et al.  Pattern Recognition and Machine Learning (Information Science and Statistics) , 2006 .

[11]  H. Robbins A Stochastic Approximation Method , 1951 .

[12]  A. P. Dawid,et al.  Generative or Discriminative? Getting the Best of Both Worlds , 2007 .

[13]  Roman Vershynin,et al.  Introduction to the non-asymptotic analysis of random matrices , 2010, Compressed Sensing.

[14]  Stéphane Mallat,et al.  Invariant Scattering Convolution Networks , 2012, IEEE transactions on pattern analysis and machine intelligence.

[15]  Anders P. Eriksson,et al.  Fast Convolutional Sparse Coding , 2013, 2013 IEEE Conference on Computer Vision and Pattern Recognition.

[16]  Yoshua Bengio,et al.  Generative Adversarial Nets , 2014, NIPS.

[17]  Max Welling,et al.  Auto-Encoding Variational Bayes , 2013, ICLR.

[18]  Max Welling,et al.  Semi-supervised Learning with Deep Generative Models , 2014, NIPS.

[19]  Brendt Wohlberg,et al.  Efficient convolutional sparse coding , 2014, 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[20]  Ruslan Salakhutdinov,et al.  Path-SGD: Path-Normalized Optimization in Deep Neural Networks , 2015, NIPS.

[21]  Tapani Raiko,et al.  Semi-supervised Learning with Ladder Networks , 2015, NIPS.

[22]  Richard G. Baraniuk,et al.  A Probabilistic Theory of Deep Learning , 2015, ArXiv.

[23]  Yann LeCun,et al.  The Loss Surfaces of Multilayer Networks , 2014, AISTATS.

[24]  Yoshua Bengio,et al.  NICE: Non-linear Independent Components Estimation , 2014, ICLR.

[25]  Dumitru Erhan,et al.  Going deeper with convolutions , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[26]  Gordon Wetzstein,et al.  Fast and flexible convolutional sparse coding , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[27]  Kenji Kawaguchi,et al.  Deep Learning without Poor Local Minima , 2016, NIPS.

[28]  Richard G. Baraniuk,et al.  Semi-Supervised Learning with the Deep Rendering Mixture Model , 2016, ArXiv.

[29]  Stéphane Mallat,et al.  Understanding deep convolutional networks , 2016, Philosophical Transactions of the Royal Society A: Mathematical, Physical and Engineering Sciences.

[30]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[31]  Wojciech Zaremba,et al.  Improved Techniques for Training GANs , 2016, NIPS.

[32]  David M. Blei,et al.  Variational Inference: A Review for Statisticians , 2016, ArXiv.

[33]  Richard G. Baraniuk,et al.  A Probabilistic Framework for Deep Learning , 2016, NIPS.

[34]  Zoubin Ghahramani,et al.  Dropout as a Bayesian Approximation: Representing Model Uncertainty in Deep Learning , 2015, ICML.

[35]  Samy Bengio,et al.  Density estimation using Real NVP , 2016, ICLR.

[36]  Joan Bruna,et al.  Topology and Geometry of Half-Rectified Network Optimization , 2016, ICLR.

[37]  Harri Valpola,et al.  Weight-averaged consistency targets improve semi-supervised deep learning results , 2017, ArXiv.

[38]  Abhishek Kumar,et al.  Semi-supervised Learning with GANs: Manifold Invariance with Improved Inference , 2017, NIPS.

[39]  Trevor Darrell,et al.  Adversarial Feature Learning , 2016, ICLR.

[40]  Xavier Gastaldi,et al.  Shake-Shake regularization of 3-branch residual networks , 2017, ICLR.

[41]  Michael Elad,et al.  Convolutional Neural Networks Analyzed via Convolutional Sparse Coding , 2016, J. Mach. Learn. Res..

[42]  Leslie Pack Kaelbling,et al.  Generalization in Deep Learning , 2017, ArXiv.

[43]  Joan Bruna,et al.  Mathematics of Deep Learning , 2017, ArXiv.

[44]  Kilian Q. Weinberger,et al.  Densely Connected Convolutional Networks , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[45]  Matus Telgarsky,et al.  Spectrally-normalized margin bounds for neural networks , 2017, NIPS.

[46]  Frank Hutter,et al.  SGDR: Stochastic Gradient Descent with Warm Restarts , 2016, ICLR.

[47]  Michael Elad,et al.  Working Locally Thinking Globally: Theoretical Guarantees for Convolutional Sparse Coding , 2017, IEEE Transactions on Signal Processing.

[48]  Timo Aila,et al.  Temporal Ensembling for Semi-Supervised Learning , 2016, ICLR.

[49]  Nathan Srebro,et al.  Exploring Generalization in Deep Learning , 2017, NIPS.

[50]  Graham W. Taylor,et al.  Improved Regularization of Convolutional Neural Networks with Cutout , 2017, ArXiv.

[51]  Zhuowen Tu,et al.  Aggregated Residual Transformations for Deep Neural Networks , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[52]  Aaron C. Courville,et al.  Adversarially Learned Inference , 2016, ICLR.

[53]  Karl J. Friston,et al.  Does predictive coding have a future? , 2018, Nature Neuroscience.

[54]  Prafulla Dhariwal,et al.  Glow: Generative Flow with Invertible 1x1 Convolutions , 2018, NeurIPS.

[55]  Zachary Chase Lipton,et al.  Born Again Neural Networks , 2018, ICML.

[56]  Ohad Shamir,et al.  Size-Independent Sample Complexity of Neural Networks , 2017, COLT.

[57]  Stefano Soatto,et al.  Emergence of Invariance and Disentanglement in Deep Representations , 2017, 2018 Information Theory and Applications Workshop (ITA).

[58]  Richard G. Baraniuk,et al.  A Spline Theory of Deep Networks , 2018, ICML.

[59]  Stefano Soatto,et al.  Information Dropout: Learning Optimal Representations Through Noisy Computation , 2016, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[60]  Sanjeev Arora,et al.  On the Optimization of Deep Networks: Implicit Acceleration by Overparameterization , 2018, ICML.

[61]  Soumendu Sundar Mukherjee Weak convergence and empirical processes , 2019 .

[62]  Shin Ishii,et al.  Virtual Adversarial Training: A Regularization Method for Supervised and Semi-Supervised Learning , 2017, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[63]  Enhua Wu,et al.  Squeeze-and-Excitation Networks , 2017, IEEE Transactions on Pattern Analysis and Machine Intelligence.