The Cramer Distance as a Solution to Biased Wasserstein Gradients

The Wasserstein probability metric has received much attention from the machine learning community. Unlike the Kullback-Leibler divergence, which strictly measures change in probability, the Wasserstein metric reflects the underlying geometry between outcomes. The value of being sensitive to this geometry has been demonstrated, among others, in ordinal regression and generative modelling, and most recently in reinforcement learning. In this paper we describe three natural properties of probability divergences that we believe reflect requirements from machine learning: sum invariance, scale sensitivity, and unbiased sample gradients. The Wasserstein metric possesses the first two properties but, unlike the Kullback-Leibler divergence, does not possess the third. We provide empirical evidence suggesting this is a serious issue in practice. Leveraging insights from probabilistic forecasting we propose an alternative to the Wasserstein metric, the Cramer distance. We show that the Cramer distance possesses all three desired properties, combining the best of the Wasserstein and Kullback-Leibler divergences. We give empirical results on a number of domains comparing these three divergences. To illustrate the practical relevance of the Cramer distance we design a new algorithm, the Cramer Generative Adversarial Network (GAN), and show that it has a number of desirable properties over the related Wasserstein GAN.

[1]  V. Zolotarev METRIC DISTANCES IN SPACES OF RANDOM VARIABLES AND THEIR DISTRIBUTIONS , 1976 .

[2]  D. Freedman,et al.  Some Asymptotic Theory for the Bootstrap , 1981 .

[3]  M. J. Sobel,et al.  Discounted MDP's: distribution functions and exponential utility maximization , 1987 .

[4]  Thomas M. Cover,et al.  Elements of Information Theory , 2005 .

[5]  A. Müller Integral Probability Metrics and Their Generating Classes of Functions , 1997, Advances in Applied Probability.

[6]  Dudley,et al.  Real Analysis and Probability: Measurability: Borel Isomorphism and Analytic Sets , 2002 .

[7]  Leonidas J. Guibas,et al.  The Earth Mover's Distance as a Metric for Image Retrieval , 2000, International Journal of Computer Vision.

[8]  J. Dedecker,et al.  The empirical distribution function for dependent variables: asymptotic and nonasymptotic results in ${\mathbb L}^p$ , 2007 .

[9]  A. Raftery,et al.  Strictly Proper Scoring Rules, Prediction, and Estimation , 2007 .

[10]  M. Veraar On Khintchine inequalities with a weight , 2009, 0909.2586.

[11]  Bernhard Schölkopf,et al.  A Kernel Two-Sample Test , 2012, J. Mach. Learn. Res..

[12]  Kenji Fukumizu,et al.  Equivalence of distance-based and RKHS-based statistics in hypothesis testing , 2012, ArXiv.

[13]  S. Rachev,et al.  The Methods of Distances in the Theory of Probability and Statistics , 2013 .

[14]  Yoshua Bengio,et al.  Generative Adversarial Nets , 2014, NIPS.

[15]  Max Welling,et al.  Auto-Encoding Variational Bayes , 2013, ICLR.

[16]  Zoubin Ghahramani,et al.  Training generative neural networks via Maximum Mean Discrepancy optimization , 2015, UAI.

[17]  Thomas Brox,et al.  U-Net: Convolutional Networks for Biomedical Image Segmentation , 2015, MICCAI.

[18]  Xiaogang Wang,et al.  Deep Learning Face Attributes in the Wild , 2014, 2015 IEEE International Conference on Computer Vision (ICCV).

[19]  Ryan P. Adams,et al.  Probabilistic Backpropagation for Scalable Learning of Bayesian Neural Networks , 2015, ICML.

[20]  Hossein Mobahi,et al.  Learning with a Wasserstein Loss , 2015, NIPS.

[21]  Richard S. Zemel,et al.  Generative Moment Matching Networks , 2015, ICML.

[22]  Maria L. Rizzo,et al.  Energy distance , 2016 .

[23]  Koray Kavukcuoglu,et al.  Pixel Recurrent Neural Networks , 2016, ICML.

[24]  A. Kleywegt,et al.  Distributionally Robust Stochastic Optimization with Wasserstein Distance , 2016, Math. Oper. Res..

[25]  Soumith Chintala,et al.  Unsupervised Representation Learning with Deep Convolutional Generative Adversarial Networks , 2015, ICLR.

[26]  Klaus-Robert Müller,et al.  Wasserstein Training of Restricted Boltzmann Machines , 2016, NIPS.

[27]  Vaibhava Goel,et al.  McGan: Mean and Covariance Feature Matching GAN , 2017, ICML.

[28]  Léon Bottou,et al.  Wasserstein GAN , 2017, ArXiv.

[29]  Alexei A. Efros,et al.  Image-to-Image Translation with Conditional Adversarial Networks , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[30]  Peter Dayan,et al.  Comparison of Maximum Likelihood and GAN-based training of Real NVPs , 2017, ArXiv.

[31]  Marc G. Bellemare,et al.  A Distributional Perspective on Reinforcement Learning , 2017, ICML.

[32]  Yiming Yang,et al.  MMD GAN: Towards Deeper Understanding of Moment Matching Network , 2017, NIPS.

[33]  Aaron C. Courville,et al.  Improved Training of Wasserstein GANs , 2017, NIPS.

[34]  Daniel Kuhn,et al.  Data-driven distributionally robust optimization using the Wasserstein metric: performance guarantees and tractable reformulations , 2015, Mathematical Programming.