Barlow Twins: Self-Supervised Learning via Redundancy Reduction

Self-supervised learning (SSL) is rapidly closing the gap with supervised methods on large computer vision benchmarks. A successful approach to SSL is to learn representations which are invariant to distortions of the input sample. However, a recurring issue with this approach is the existence of trivial constant representations. Most current methods avoid such collapsed solutions by careful implementation details. We propose an objective function that naturally avoids such collapse by measuring the cross-correlation matrix between the outputs of two identical networks fed with distorted versions of a sample, and making it as close to the identity matrix as possible. This causes the representation vectors of distorted versions of a sample to be similar, while minimizing the redundancy between the components of these vectors. The method is called BARLOW TWINS, owing to neuroscientist H. Barlow’s redundancy-reduction principle applied to a pair of identical networks. BARLOW TWINS does not require large batches nor asymmetry between the network twins such as a predictor network, gradient stopping, or a moving average on the weight updates. It allows the use of very high-dimensional output vectors. BARLOW TWINS outperforms previous methods on ImageNet for semi-supervised classification in the low-data regime, and is on par with current state of the art for ImageNet classification with a linear classifier head, and for transfer tasks of classification and object detection. 1

[1]  Kaiming He,et al.  Momentum Contrast for Unsupervised Visual Representation Learning , 2019, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[2]  Yuandong Tian,et al.  Understanding self-supervised Learning Dynamics without Contrastive Pairs , 2021, ICML.

[3]  Geoffrey E. Hinton,et al.  Self-organizing neural network that discovers surfaces in random-dot stereograms , 1992, Nature.

[4]  Valero Laparra,et al.  End-to-end Optimized Image Compression , 2016, ICLR.

[5]  Andrea Vedaldi,et al.  Self-labelling via simultaneous clustering and representation learning , 2020, ICLR.

[6]  Nicu Sebe,et al.  Whitening for Self-Supervised Representation Learning , 2020, ICML.

[7]  Luc Van Gool,et al.  The Pascal Visual Object Classes (VOC) Challenge , 2010, International Journal of Computer Vision.

[8]  H. B. Barlow,et al.  Possible Principles Underlying the Transformations of Sensory Messages , 2012 .

[9]  Xinlei Chen,et al.  Exploring Simple Siamese Representation Learning , 2020, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[10]  Ali Razavi,et al.  Data-Efficient Image Recognition with Contrastive Predictive Coding , 2019, ICML.

[11]  Pietro Perona,et al.  Microsoft COCO: Common Objects in Context , 2014, ECCV.

[12]  Oriol Vinyals,et al.  Representation Learning with Contrastive Predictive Coding , 2018, ArXiv.

[13]  Laurens van der Maaten,et al.  Self-Supervised Learning of Pretext-Invariant Representations , 2019, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[14]  Geoffrey E. Hinton,et al.  Discovering Viewpoint-Invariant Relationships That Characterize Objects , 1990, NIPS.

[15]  Tengyuan Liang,et al.  Law of log determinant of sample covariance matrix and optimal estimation of differential entropy for high-dimensional Gaussian distributions , 2013, Journal of Multivariate Analysis.

[16]  Jürgen Schmidhuber,et al.  Semilinear Predictability Minimization Produces Well-Known Feature Detectors , 1996, Neural Computation.

[17]  Kaiming He,et al.  Accurate, Large Minibatch SGD: Training ImageNet in 1 Hour , 2017, ArXiv.

[18]  A. Norman Redlich,et al.  Redundancy Reduction as a Strategy for Unsupervised Learning , 1993, Neural Computation.

[19]  P. Földiák,et al.  Forming sparse representations by local anti-Hebbian learning , 1990, Biological Cybernetics.

[20]  Lucas C. Parra,et al.  Non-linear Feature Extraction by Redundancy Reduction in an Unsupervised Stochastic Neural Network , 1997, Neural Networks.

[21]  Dezhong Peng,et al.  Contrastive Clustering , 2020, AAAI.

[22]  Armand Joulin,et al.  Unsupervised Learning by Predicting Noise , 2017, ICML.

[23]  Michal Valko,et al.  Bootstrap Your Own Latent: A New Approach to Self-Supervised Learning , 2020, NeurIPS.

[24]  Ralph Linsker,et al.  Self-organization in a perceptual network , 1988, Computer.

[25]  Geoffrey E. Hinton,et al.  A Simple Framework for Contrastive Learning of Visual Representations , 2020, ICML.

[26]  Surya Ganguli,et al.  A Unified Theory Of Early Visual Representations From Retina To Cortex Through Anatomically Constrained Deep CNNs , 2019, bioRxiv.

[27]  Yang You,et al.  Large Batch Training of Convolutional Networks , 2017, 1708.03888.

[28]  Abhinav Gupta,et al.  Scaling and Benchmarking Self-Supervised Visual Representation Learning , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[29]  H Barlow,et al.  Redundancy reduction revisited , 2001, Network.

[30]  Naftali Tishby,et al.  The information bottleneck method , 2000, ArXiv.

[31]  Ross B. Girshick,et al.  Mask R-CNN , 2017, 1703.06870.

[32]  Kaiming He,et al.  Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks , 2015, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[33]  Naftali Tishby,et al.  Deep learning and the information bottleneck principle , 2015, 2015 IEEE Information Theory Workshop (ITW).

[34]  Frank Hutter,et al.  SGDR: Stochastic Gradient Descent with Warm Restarts , 2016, ICLR.

[35]  Kaiming He,et al.  Improved Baselines with Momentum Contrastive Learning , 2020, ArXiv.

[36]  Xinlei Chen,et al.  Understanding Self-supervised Learning with Dual Deep Networks , 2020, ArXiv.

[37]  Li Fei-Fei,et al.  ImageNet: A large-scale hierarchical image database , 2009, CVPR.

[38]  Phillip Isola,et al.  Understanding Contrastive Representation Learning through Alignment and Uniformity on the Hypersphere , 2020, ICML.

[39]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[40]  Eero P. Simoncelli,et al.  Natural signal statistics and sensory gain control , 2001, Nature Neuroscience.

[41]  A. Norman Redlich,et al.  Supervised Factorial Learning , 1993, Neural Computation.

[42]  Surya Ganguli,et al.  The emergence of multiple retinal cell types through efficient coding of natural movies , 2018, bioRxiv.

[43]  Aapo Hyvärinen,et al.  Noise-contrastive estimation: A new estimation principle for unnormalized statistical models , 2010, AISTATS.

[44]  Bolei Zhou,et al.  Learning Deep Features for Scene Recognition using Places Database , 2014, NIPS.

[45]  Yann LeCun,et al.  Dimensionality Reduction by Learning an Invariant Mapping , 2006, 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'06).

[46]  Razvan Pascanu,et al.  BYOL works even without batch statistics , 2020, ArXiv.

[47]  Yang Song,et al.  The iNaturalist Species Classification and Detection Dataset , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[48]  Matthijs Douze,et al.  Deep Clustering for Unsupervised Learning of Visual Features , 2018, ECCV.

[49]  Julien Mairal,et al.  Unsupervised Learning of Visual Features by Contrasting Cluster Assignments , 2020, NeurIPS.