论文信息 - CLOOB: Modern Hopfield Networks with InfoLOOB Outperform CLIP

CLOOB: Modern Hopfield Networks with InfoLOOB Outperform CLIP

Contrastive learning with the InfoNCE objective is exceptionally successful in various self-supervised learning tasks. Recently, the CLIP model yielded impressive results on zero-shot transfer learning when using InfoNCE for learning visual representations from natural language supervision. However, InfoNCE as a lower bound on the mutual information has been shown to perform poorly for high mutual information. In contrast, the InfoLOOB upper bound (leave one out bound) works well for high mutual information but suffers from large variance and instabilities. We introduce “Contrastive Leave One Out Boost” (CLOOB), where modern Hopfield networks boost learning with the InfoLOOB objective. Modern Hopfield networks replace the original embeddings by retrieved embeddings in the InfoLOOB objective. The retrieved embeddings give InfoLOOB two assets. Firstly, the retrieved embeddings stabilize InfoLOOB, since they are less noisy and more similar to one another than the original embeddings. Secondly, they are enriched by correlations, since the covariance structure of embeddings is reinforced through retrievals. We compare CLOOB to CLIP after learning on the Conceptual Captions and the YFCC dataset with respect to their zero-shot transfer learning performance on other datasets. CLOOB consistently outperforms CLIP at zero-shot transfer learning across all considered architectures and datasets.

[1] Frank Hutter,et al. Decoupled Weight Decay Regularization , 2017, ICLR.

[2] Seung Woo Lee,et al. Birdsnap: Large-Scale Fine-Grained Visual Categorization of Birds , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[3] Alexander A. Alemi,et al. On Variational Bounds of Mutual Information , 2019, ICML.

[4] Yu Wang,et al. Joint Contrastive Learning with Infinite Possibilities , 2020, NeurIPS.

[5] Benjamin Recht,et al. Measuring Robustness to Natural Distribution Shifts in Image Classification , 2020, NeurIPS.

[6] Honglak Lee,et al. An efficient framework for learning sentence representations , 2018, ICLR.

[7] Andrew Zisserman,et al. Automated Flower Classification over a Large Number of Classes , 2008, 2008 Sixth Indian Conference on Computer Vision, Graphics & Image Processing.

[8] Ruslan Salakhutdinov,et al. Conditional Contrastive Learning: Removing Undesirable Information in Self-Supervised Representations , 2021, ArXiv.

[9] Benjamin Recht,et al. Do ImageNet Classifiers Generalize to ImageNet? , 2019, ICML.

[10] Yair Carmon,et al. Accuracy on the Line: on the Strong Correlation Between Out-of-Distribution and In-Distribution Generalization , 2021, ICML.

[11] Mike Wu,et al. Conditional Negative Sampling for Contrastive Learning of Visual Representations , 2020, ICLR.

[12] M. Bethge,et al. Shortcut learning in deep neural networks , 2020, Nature Machine Intelligence.

[13] Johannes Stallkamp,et al. The German Traffic Sign Recognition Benchmark: A multi-class classification competition , 2011, The 2011 International Joint Conference on Neural Networks.

[14] Ronald F. Boisvert,et al. NIST Handbook of Mathematical Functions , 2010 .

[15] Andrew Zisserman,et al. Self-supervised Co-training for Video Representation Learning , 2020, NeurIPS.

[16] Li Fei-Fei,et al. ImageNet: A large-scale hierarchical image database , 2009, CVPR.

[17] Stella X. Yu,et al. Unsupervised Feature Learning via Non-parametric Instance Discrimination , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[18] Rufin van Rullen,et al. Does language help generalization in vision models? , 2021, CONLL.

[19] Jian Sun,et al. Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[20] Xinlei Chen,et al. Exploring Simple Siamese Representation Learning , 2020, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[21] Andreas Terzis,et al. Poisoning and Backdooring Contrastive Learning , 2021, ArXiv.

[22] Junnan Li,et al. Prototypical Contrastive Learning of Unsupervised Representations , 2020, ICLR.

[23] Gigliola Vaglini,et al. Generating images from caption and vice versa via CLIP-Guided Generative Latent Space Search , 2021, IMPROVE.

[24] Frank Hutter,et al. SGDR: Stochastic Gradient Descent with Warm Restarts , 2016, ICLR.

[25] Karsten Roth,et al. Characterizing Generalization under Out-Of-Distribution Shifts in Deep Metric Learning , 2021, NeurIPS.

[26] Jonathan Krause,et al. 3D Object Representations for Fine-Grained Categorization , 2013, 2013 IEEE International Conference on Computer Vision Workshops.

[27] Oriol Vinyals,et al. Representation Learning with Contrastive Predictive Coding , 2018, ArXiv.

[28] Ilya Sutskever,et al. Learning Transferable Visual Models From Natural Language Supervision , 2021, ICML.

[29] Laurens van der Maaten,et al. Self-Supervised Learning of Pretext-Invariant Representations , 2019, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[30] Karl Stratos,et al. Formal Limitations on the Measurement of Mutual Information , 2018, AISTATS.

[31] Michal Valko,et al. Bootstrap Your Own Latent: A New Approach to Self-Supervised Learning , 2020, NeurIPS.

[32] Michael Tschannen,et al. On Mutual Information Maximization for Representation Learning , 2019, ICLR.