Learning to generate images with perceptual similarity metrics

Deep networks are increasingly being applied to problems involving image synthesis, e.g., generating images from textual descriptions and reconstructing an input image from a compact representation. Supervised training of image-synthesis networks typically uses a pixel-wise loss (PL) to indicate the mismatch between a generated image and its corresponding target image. We propose instead to use a loss function that is better calibrated to human perceptual judgments of image quality: the multiscale structural-similarity score (MS-SSIM) [1]. Because MS-SSIM is differentiable, it is easily incorporated into gradient-descent learning. We compare the consequences of using MS-SSIM versus PL loss on training autoencoders. Human observers reliably prefer images synthesized by MS-SSIM-optimized models over those synthesized by PL-optimized models, for two distinct PL measures (L1 and L2 distances). We also explore the effect of training objective on image encoding and analyze conditions under which perceptually-optimized representations yield better performance on image classification. Finally, we demonstrate the superiority of perceptually-optimized networks for super-resolution imaging. We argue that significant advances can be made in modeling images through the use of training objectives that are well aligned to characteristics of human perception.

[1]  Paul Smolensky,et al.  Information processing in dynamical systems: foundations of harmony theory , 1986 .

[2]  Geoffrey E. Hinton,et al.  Learning and relearning in Boltzmann machines , 1986 .

[3]  Scott J. Daly,et al.  Visible differences predictor: an algorithm for the assessment of image fidelity , 1992, Electronic Imaging.

[4]  Michael I. Jordan,et al.  Forward Models: Supervised Learning with a Distal Teacher , 1992, Cogn. Sci..

[5]  Geoffrey E. Hinton,et al.  The Helmholtz Machine , 1995, Neural Computation.

[6]  Olivier Verscheure,et al.  Perceptual quality measure using a spatiotemporal model of the human visual system , 1996, Electronic Imaging.

[7]  Jan P. Allebach,et al.  Methodology for designing image similarity metrics based on human visual system models , 1997, Electronic Imaging.

[8]  Jeffrey Lubin A human vision system model for objective image fidelity and target detectability measurements , 1998, 9th European Signal Processing Conference (EUSIPCO 1998).

[9]  Stefan Winkler,et al.  A perceptual distortion metric for digital color images , 1998, Proceedings 1998 International Conference on Image Processing. ICIP98 (Cat. No.98CB36269).

[10]  Jitendra Malik,et al.  A database of human segmented natural images and its application to evaluating segmentation algorithms and measuring ecological statistics , 2001, Proceedings Eighth IEEE International Conference on Computer Vision. ICCV 2001.

[11]  Zhou Wang,et al.  Multi-scale structural similarity for image quality assessment , 2003 .

[12]  Eero P. Simoncelli,et al.  Image quality assessment: from error visibility to structural similarity , 2004, IEEE Transactions on Image Processing.

[13]  David J. Kriegman,et al.  Acquiring linear subspaces for face recognition under variable lighting , 2005, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[14]  Yee Whye Teh,et al.  A Fast Learning Algorithm for Deep Belief Nets , 2006, Neural Computation.

[15]  Alan C. Bovik,et al.  Image information and visual quality , 2006, IEEE Trans. Image Process..

[16]  Sheila S. Hemami,et al.  VSNR: A Wavelet-Based Visual Signal-to-Noise Ratio for Natural Images , 2007, IEEE Transactions on Image Processing.

[17]  Eero P. Simoncelli,et al.  Maximum differentiation (MAD) competition: a methodology for comparing computational models of perceptual quantities. , 2008, Journal of vision.

[18]  Antonio Torralba,et al.  Ieee Transactions on Pattern Analysis and Machine Intelligence 1 80 Million Tiny Images: a Large Dataset for Non-parametric Object and Scene Recognition , 2022 .

[19]  Fei-Fei Li,et al.  ImageNet: A large-scale hierarchical image database , 2009, 2009 IEEE Conference on Computer Vision and Pattern Recognition.

[20]  Michael Elad,et al.  On Single Image Scale-Up Using Sparse-Representations , 2010, Curves and Surfaces.

[21]  Jürgen Schmidhuber,et al.  Stacked Convolutional Auto-Encoders for Hierarchical Feature Extraction , 2011, ICANN.

[22]  Geoffrey E. Hinton,et al.  Using very deep autoencoders for content-based image retrieval , 2011, ESANN.

[23]  Honglak Lee,et al.  An Analysis of Single-Layer Networks in Unsupervised Feature Learning , 2011, AISTATS.

[24]  Orly Yadid-Pecht,et al.  Quaternion Structural Similarity: A New Quality Index for Color Images , 2012, IEEE Transactions on Image Processing.

[25]  Mohammed Hassan,et al.  Structural Similarity Measure for Color Images , 2012 .

[26]  Aline Roumy,et al.  Low-Complexity Single-Image Super-Resolution based on Nonnegative Neighbor Embedding , 2012, BMVC.

[27]  Chuohao Yeo,et al.  On Rate Distortion Optimization Using SSIM , 2013, IEEE Trans. Circuits Syst. Video Technol..

[28]  Max Welling,et al.  Auto-Encoding Variational Bayes , 2013, ICLR.

[29]  Jan Kautz,et al.  Is L2 a Good Loss Function for Neural Networks for Image Processing , 2015 .

[30]  Rob Fergus,et al.  Deep Generative Image Models using a Laplacian Pyramid of Adversarial Networks , 2015, NIPS.

[31]  Sergey Ioffe,et al.  Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift , 2015, ICML.

[32]  Alex Graves,et al.  DRAW: A Recurrent Neural Network For Image Generation , 2015, ICML.

[33]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[34]  Richard S. Zemel,et al.  Generative Moment Matching Networks , 2015, ICML.

[35]  Arthur Gretton,et al.  A Test of Relative Similarity For Model Selection in Generative Models , 2015, ICLR.

[36]  Matthias Bethge,et al.  A note on the evaluation of generative models , 2015, ICLR.

[37]  Xiaoou Tang,et al.  Image Super-Resolution Using Deep Convolutional Networks , 2014, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[38]  Soumith Chintala,et al.  Unsupervised Representation Learning with Deep Convolutional Generative Adversarial Networks , 2015, ICLR.

[39]  Xinyun Chen Under Review as a Conference Paper at Iclr 2017 Delving into Transferable Adversarial Ex- Amples and Black-box Attacks , 2016 .

[40]  Weisi Lin,et al.  Maximum a Posterior and Perceptually Motivated Reconstruction Algorithm: A Generic Framework , 2017, IEEE Transactions on Multimedia.