Lightweight dynamic conditional GAN with pyramid attention for text-to-image synthesis

Abstract The text-to-image synthesis task aims to generate photographic images conditioned on semantic text descriptions. To ensure the sharpness and fidelity of generated images, this task tends to generate high-resolution images (e.g., 1282 or 2562). However, as the resolution increases, the network parameters and complexity increases dramatically. Recent works introduce network structures with extensive parameters and heavy computations to guarantee the production of high-resolution images. As a result, these models come across problems of the unstable training process and high training cost. To tackle these issues, in this paper, we propose an effective information compensation based approach, namely Lightweight Dynamic Conditional GAN (LD-CGAN). LD-CGAN is a compact and structured single-stream network, and it consists of one generator and two independent discriminators to regularize and generate 642 and 1282 images in one feed-forward process. Specifically, the generator of LD-CGAN is composed of three major components: (1) Conditional Embedding (CE), which is an automatically unsupervised learning process aiming at disentangling integrated semantic attributes in the text space; (2) Conditional Manipulating Modular (CM-M) in Conditional Manipulating Block (CM-B), which is designed to continuously provide the image features with the compensation information (i.e., the disentangled attribute); and (3) Pyramid Attention Refine Block (PAR-B), which is used to enrich multi-scale features by capturing spatial importance between multi-scale context. Consequently, experiments conducted under two benchmark datasets, CUB and Oxford-102, indicate that our generated 1282 images can achieve comparable performance with 2562 images generated by the state-of-the-arts on two evaluation metrics: Inception Score (IS) and Visual-semantic Similarity (VS). Compared with the current state-of-the-art HDGAN, our LD-CGAN significantly decreases the number of parameters and computation time by 86.8% and 94.9%, respectively.

[1]  Jie Li,et al.  Unsupervised Semantic-Preserving Adversarial Hashing for Image Search , 2019, IEEE Transactions on Image Processing.

[2]  Raymond Y. K. Lau,et al.  Least Squares Generative Adversarial Networks , 2016, 2017 IEEE International Conference on Computer Vision (ICCV).

[3]  Xuelong Li,et al.  From Deterministic to Generative: Multimodal Stochastic RNNs for Video Captioning , 2017, IEEE Transactions on Neural Networks and Learning Systems.

[4]  Timo Aila,et al.  A Style-Based Generator Architecture for Generative Adversarial Networks , 2018, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[5]  Jaakko Lehtinen,et al.  Progressive Growing of GANs for Improved Quality, Stability, and Variation , 2017, ICLR.

[6]  Andrew Zisserman,et al.  Very Deep Convolutional Networks for Large-Scale Image Recognition , 2014, ICLR.

[7]  Yang Yang,et al.  Multitask Spectral Clustering by Exploring Intertask Correlation , 2015, IEEE Transactions on Cybernetics.

[8]  Ya Ju Fan,et al.  Autoencoder Node Saliency: Selecting Relevant Latent Representations , 2017, Pattern Recognit..

[9]  Xianglong Liu,et al.  Spatio-temporal deformable 3D ConvNets with attention for action recognition , 2020, Pattern Recognit..

[10]  Li Fei-Fei,et al.  Perceptual Losses for Real-Time Style Transfer and Super-Resolution , 2016, ECCV.

[11]  Zhe Gan,et al.  AttnGAN: Fine-Grained Text to Image Generation with Attentional Generative Adversarial Networks , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[12]  Shichao Zhang,et al.  Robust Joint Graph Sparse Coding for Unsupervised Spectral Feature Selection , 2017, IEEE Transactions on Neural Networks and Learning Systems.

[13]  Heng Tao Shen,et al.  Perceptual Pyramid Adversarial Networks for Text-to-Image Synthesis , 2019, AAAI.

[14]  Dimitris N. Metaxas,et al.  StackGAN: Text to Photo-Realistic Image Synthesis with Stacked Generative Adversarial Networks , 2016, 2017 IEEE International Conference on Computer Vision (ICCV).

[15]  Andrew Zisserman,et al.  Automated Flower Classification over a Large Number of Classes , 2008, 2008 Sixth Indian Conference on Computer Vision, Graphics & Image Processing.

[16]  Kai Fan,et al.  Zero-Shot Learning via Class-Conditioned Deep Generative Models , 2017, AAAI.

[17]  Bernt Schiele,et al.  Generative Adversarial Text to Image Synthesis , 2016, ICML.

[18]  Dacheng Tao,et al.  Two-Stream Deep Hashing With Class-Specific Centers for Supervised Image Search , 2020, IEEE Transactions on Neural Networks and Learning Systems.

[19]  Lin Yang,et al.  Photographic Text-to-Image Synthesis with a Hierarchically-Nested Adversarial Network , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[20]  Cheng Shi,et al.  Adaptive multi-scale deep neural networks with perceptual loss for panchromatic and multispectral images classification , 2019, Inf. Sci..

[21]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[22]  Nicu Sebe,et al.  Binary Neural Networks: A Survey , 2020, Pattern Recognit..

[23]  Heng Tao Shen,et al.  Unified Binary Generative Adversarial Network for Image Retrieval and Compression , 2020, International Journal of Computer Vision.

[24]  Sergey Ioffe,et al.  Inception-v4, Inception-ResNet and the Impact of Residual Connections on Learning , 2016, AAAI.

[25]  Bernt Schiele,et al.  Learning What and Where to Draw , 2016, NIPS.

[26]  Sergey Ioffe,et al.  Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift , 2015, ICML.

[27]  Sepp Hochreiter,et al.  GANs Trained by a Two Time-Scale Update Rule Converge to a Local Nash Equilibrium , 2017, NIPS.

[28]  Bernt Schiele,et al.  Learning Deep Representations of Fine-Grained Visual Descriptions , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[29]  Jian Sun,et al.  Delving Deep into Rectifiers: Surpassing Human-Level Performance on ImageNet Classification , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[30]  Yang Yang,et al.  Adversarial Cross-Modal Retrieval , 2017, ACM Multimedia.

[31]  Xiaogang Wang,et al.  StackGAN++: Realistic Image Synthesis with Stacked Generative Adversarial Networks , 2017, IEEE Transactions on Pattern Analysis and Machine Intelligence.