Learning the Stein Discrepancy for Training and Evaluating Energy-Based Models without Sampling

We present a new method for evaluating and training unnormalized density models. Our approach only requires access to the gradient of the unnormalized model's log-density. We estimate the Stein discrepancy between the data density $p(x)$ and the model density $q(x)$ defined by a vector function of the data. We parameterize this function with a neural network and fit its parameters to maximize the discrepancy. This yields a novel goodness-of-fit test which outperforms existing methods on high dimensional data. Furthermore, optimizing $q(x)$ to minimize this discrepancy produces a novel method for training unnormalized models which scales more gracefully than existing methods. The ability to both learn and compare models is a unique feature of the proposed method.

[1]  Max Welling,et al.  Auto-Encoding Variational Bayes , 2013, ICLR.

[2]  Dilin Wang,et al.  Stein Variational Gradient Descent: A General Purpose Bayesian Inference Algorithm , 2016, NIPS.

[3]  Prafulla Dhariwal,et al.  Glow: Generative Flow with Invertible 1x1 Convolutions , 2018, NeurIPS.

[4]  Bernhard Schölkopf,et al.  A Kernel Two-Sample Test , 2012, J. Mach. Learn. Res..

[5]  Yuichi Yoshida,et al.  Spectral Normalization for Generative Adversarial Networks , 2018, ICLR.

[6]  Geoffrey E. Hinton Training Products of Experts by Minimizing Contrastive Divergence , 2002, Neural Computation.

[7]  Tian Han,et al.  On the Anatomy of MCMC-based Maximum Likelihood Learning of Energy-Based Models , 2019, AAAI.

[8]  Hao Wu,et al.  Boltzmann generators: Sampling equilibrium states of many-body systems with deep learning , 2018, Science.

[9]  Mohammad Norouzi,et al.  Understanding Posterior Collapse in Generative Latent Variable Models , 2019, DGS@ICLR.

[10]  Yann LeCun,et al.  Synergistic Face Detection and Pose Estimation with Energy-Based Models , 2004, J. Mach. Learn. Res..

[11]  Lester W. Mackey,et al.  Measuring Sample Quality with Kernels , 2017, ICML.

[12]  Alessandro Barp,et al.  Minimum Stein Discrepancy Estimators , 2019, NeurIPS.

[13]  Bernhard Schölkopf,et al.  Deep Energy Estimator Networks , 2018, ArXiv.

[14]  M. Hutchinson A stochastic estimator of the trace of the influence matrix for laplacian smoothing splines , 1989 .

[15]  Erik Nijkamp,et al.  On Learning Non-Convergent Short-Run MCMC Toward Energy-Based Model , 2019, ArXiv.

[16]  Qiang Liu,et al.  A Kernelized Stein Discrepancy for Goodness-of-fit Tests , 2016, ICML.

[17]  John O'Leary,et al.  Unbiased Markov chain Monte Carlo with couplings , 2017, 1708.03625.

[18]  Tapani Raiko,et al.  Gaussian-Bernoulli deep Boltzmann machine , 2013, The 2013 International Joint Conference on Neural Networks (IJCNN).

[19]  Song-Chun Zhu,et al.  Learning Descriptor Networks for 3D Shape Synthesis and Analysis , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[20]  Quoc V. Le,et al.  Searching for Activation Functions , 2018, arXiv.

[21]  Emmanuel Müller,et al.  The Shape of Data: Intrinsic Distance for Data Distributions , 2020, ICLR.

[22]  Jeff Donahue,et al.  Large Scale GAN Training for High Fidelity Natural Image Synthesis , 2018, ICLR.

[23]  Marc'Aurelio Ranzato,et al.  Energy-Based Models in Document Recognition and Computer Vision , 2007, Ninth International Conference on Document Analysis and Recognition (ICDAR 2007).

[24]  Yee Whye Teh,et al.  A Fast Learning Algorithm for Deep Belief Nets , 2006, Neural Computation.

[25]  Mohammad Norouzi,et al.  Your Classifier is Secretly an Energy Based Model and You Should Treat it Like One , 2019, ICLR.

[26]  Zhijian Ou,et al.  Learning Neural Random Fields with Inclusive Auxiliary Generators , 2018, ArXiv.

[27]  Matthias Bethge,et al.  A note on the evaluation of generative models , 2015, ICLR.

[28]  Song-Chun Zhu,et al.  Learning Energy-Based Spatial-Temporal Generative ConvNets for Dynamic Patterns , 2019, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[29]  Nitish Srivastava,et al.  Dropout: a simple way to prevent neural networks from overfitting , 2014, J. Mach. Learn. Res..

[30]  Igor Mordatch,et al.  Implicit Generation and Generalization with Energy Based Models , 2018 .

[31]  Pascal Vincent,et al.  A Connection Between Score Matching and Denoising Autoencoders , 2011, Neural Computation.

[32]  Michael U. Gutmann,et al.  Conditional Noise-Contrastive Estimation of Unnormalised Models , 2018, ICML.

[33]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[34]  Dustin Tran,et al.  Operator Variational Inference , 2016, NIPS.

[35]  Aapo Hyvärinen,et al.  Noise-contrastive estimation: A new estimation principle for unnormalized statistical models , 2010, AISTATS.

[36]  Yoshua Bengio,et al.  Equilibrium Propagation: Bridging the Gap between Energy-Based Models and Backpropagation , 2016, Front. Comput. Neurosci..

[37]  Alexander J. Smola,et al.  Generative Models and Model Criticism via Optimized Maximum Mean Discrepancy , 2016, ICLR.

[38]  Andrew M. Dai,et al.  Flow Contrastive Estimation of Energy-Based Models , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[39]  Yang Song,et al.  Generative Modeling by Estimating Gradients of the Data Distribution , 2019, NeurIPS.

[40]  Shakir Mohamed,et al.  Variational Inference with Normalizing Flows , 2015, ICML.

[41]  Yoshua Bengio,et al.  Generative Adversarial Nets , 2014, NIPS.

[42]  Rob Fergus,et al.  Energy-based models for atomic-resolution protein conformations , 2020, ICLR.

[43]  C. Stein A bound for the error in the normal approximation to the distribution of a sum of dependent random variables , 1972 .

[44]  Jinwoo Shin,et al.  Approximating Spectral Sums of Large-Scale Matrices using Stochastic Chebyshev Approximations , 2017, SIAM J. Sci. Comput..

[45]  Kenji Fukumizu,et al.  A Linear-Time Kernel Goodness-of-Fit Test , 2017, NIPS.

[46]  Yang Lu,et al.  A Theory of Generative ConvNet , 2016, ICML.

[47]  Debora S. Marks,et al.  Learning Protein Structure with a Differentiable Simulator , 2018, ICLR.

[48]  David Duvenaud,et al.  FFJORD: Free-form Continuous Dynamics for Scalable Reversible Generative Models , 2018, ICLR.

[49]  Yee Whye Teh,et al.  Bayesian Learning via Stochastic Gradient Langevin Dynamics , 2011, ICML.

[50]  Aapo Hyvärinen,et al.  Estimation of Non-Normalized Statistical Models by Score Matching , 2005, J. Mach. Learn. Res..

[51]  Yang Song,et al.  Sliced Score Matching: A Scalable Approach to Density and Score Estimation , 2019, UAI.

[52]  Nicola De Cao,et al.  Explorations in Homeomorphic Variational Auto-Encoding , 2018, ArXiv.