What Do We Maximize in Self-Supervised Learning?

In this paper, we examine self-supervised learning methods, particularly VICReg, to provide an information-theoretical understanding of their construction. As a first step, we demonstrate how information-theoretic quantities can be obtained for a deterministic network, offering a possible alternative to prior work that relies on stochastic models. This enables us to demonstrate how VICReg can be (re)discovered from first principles and its assumptions about data distribution. Fur-thermore, we empirically demonstrate the validity of our assumptions, confirming our novel understanding of VICReg. Finally, we believe that the derivation and insights we obtain can be gener-alized to many other SSL methods, opening new avenues for theoretical and practical understanding of SSL and transfer learning.

[1]  Ravid Shwartz-Ziv Information Flow in Deep Neural Networks , 2022, ArXiv.

[2]  John Canny,et al.  Compressive Visual Representations , 2021, NeurIPS.

[3]  Chris J. Maddison,et al.  Lossy Compression for Lossless Prediction , 2021, NeurIPS.

[4]  Julien Mairal,et al.  Emerging Properties in Self-Supervised Vision Transformers , 2021, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).

[5]  Xinlei Chen,et al.  Exploring Simple Siamese Representation Learning , 2020, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[6]  J. Lee,et al.  Predicting What You Already Know Helps: Provable Self-Supervised Learning , 2020, NeurIPS.

[7]  Julien Mairal,et al.  Unsupervised Learning of Visual Features by Contrasting Cluster Assignments , 2020, NeurIPS.

[8]  Pierre H. Richemond,et al.  Bootstrap Your Own Latent: A New Approach to Self-Supervised Learning , 2020, NeurIPS.

[9]  Zeynep Akata,et al.  Learning Robust Representations via Multi-View Information Bottleneck , 2020, ICLR.

[10]  Geoffrey E. Hinton,et al.  A Simple Framework for Contrastive Learning of Visual Representations , 2020, ICML.

[11]  Thomas Steinke,et al.  Reasoning About Generalization via Conditional Mutual Information , 2020, COLT.

[12]  Laurens van der Maaten,et al.  Self-Supervised Learning of Pretext-Invariant Representations , 2019, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[13]  Naftali Tishby,et al.  The Dual Information Bottleneck , 2019, ArXiv.

[14]  Rana Ali Amjad,et al.  Learning Representations for Neural Network-Based Classification Using the Information Bottleneck Principle , 2018, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[15]  Alexander A. Alemi,et al.  Information in Infinite Ensembles of Infinitely-Wide Neural Networks , 2019, AABI.

[16]  Alexander A. Alemi,et al.  On Variational Bounds of Mutual Information , 2019, ICML.

[17]  Mikhail Khodak,et al.  A Theoretical Analysis of Contrastive Unsupervised Representation Learning , 2019, ICML.

[18]  Yoshua Bengio,et al.  Learning deep representations by mutual information estimation and maximization , 2018, ICLR.

[19]  Brian Kingsbury,et al.  Estimating Information Flow in Neural Networks , 2018, ArXiv.

[20]  Naftali Tishby,et al.  REPRESENTATION COMPRESSION AND GENERALIZATION IN DEEP NEURAL NETWORKS , 2018 .

[21]  Oriol Vinyals,et al.  Representation Learning with Contrastive Predictive Coding , 2018, ArXiv.

[22]  Aaron C. Courville,et al.  MINE: Mutual Information Neural Estimation , 2018, ArXiv.

[23]  Stefano Soatto,et al.  Information Dropout: Learning Optimal Representations Through Noisy Computation , 2016, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[24]  Richard G. Baraniuk,et al.  A Spline Theory of Deep Networks , 2018, ICML.

[25]  Artemy Kolchinsky,et al.  Estimating Mixture Entropy with Pairwise Distances , 2017, Entropy.

[26]  Maxim Raginsky,et al.  Information-theoretic analysis of generalization capability of learning algorithms , 2017, NIPS.

[27]  Naftali Tishby,et al.  Opening the Black Box of Deep Neural Networks via Information , 2017, ArXiv.

[28]  Alexander A. Alemi,et al.  Deep Variational Information Bottleneck , 2017, ICLR.

[29]  Kamyar Moshksar,et al.  Arbitrarily Tight Bounds on Differential Entropy of Gaussian Mixtures , 2016, IEEE Transactions on Information Theory.

[30]  Guigang Zhang,et al.  Deep Learning , 2016, Int. J. Semantic Comput..

[31]  Razvan Pascanu,et al.  On the Number of Linear Regions of Deep Neural Networks , 2014, NIPS.

[32]  Magnus Egerstedt,et al.  Control Theoretic Splines: Optimal Control, Statistics, and Path Planning , 2009 .

[33]  Mikhail Belkin,et al.  DATA SPECTROSCOPY: EIGENSPACES OF CONVOLUTION OPERATORS AND CLUSTERING , 2008, 0807.3719.

[34]  Alex Krizhevsky,et al.  Learning Multiple Layers of Features from Tiny Images , 2009 .

[35]  Hugh F. Durrant-Whyte,et al.  On entropy approximation for Gaussian mixture random vectors , 2008, 2008 IEEE International Conference on Multisensor Fusion and Integration for Intelligent Systems.

[36]  C. Fantuzzi,et al.  Identification of piecewise affine models in noisy environment , 2002 .

[37]  Ward Cheney,et al.  A course in approximation theory , 1999 .

[38]  Yann LeCun,et al.  Signature Verification Using A "Siamese" Time Delay Neural Network , 1993, Int. J. Pattern Recognit. Artif. Intell..

[39]  Ralph Linsker,et al.  Self-organization in a perceptual network , 1988, Computer.

[40]  R. D'Agostino An omnibus test of normality for moderate and large size samples , 1971 .