Pre-Train Your Loss: Easy Bayesian Transfer Learning with Informative Priors

Deep learning is increasingly moving towards a transfer learning paradigm whereby large foundation models are fine-tuned on downstream tasks, starting from an initialization learned on the source task. But an initialization contains relatively little information about the source task. Instead, we show that we can learn highly informative posteriors from the source task, through supervised or self-supervised approaches, which then serve as the basis for priors that modify the whole loss surface on the downstream task. This simple modular approach enables significant performance gains and more data-efficient learning on a variety of downstream classification and segmentation tasks, serving as a drop-in replacement for standard pre-training strategies. These highly informative priors also can be saved for future use, similar to pre-trained weights, and stand in contrast to the zero-mean isotropic uninformative priors that are typically used in Bayesian deep learning.

[1]  Richard E. Turner,et al.  Bayesian Neural Network Priors Revisited , 2021, ICLR.

[2]  Jie Lu,et al.  Bayesian Transfer Learning: An Overview of Probabilistic Graphical Models for Transfer Learning , 2021, ArXiv.

[3]  Michael S. Bernstein,et al.  On the Opportunities and Risks of Foundation Models , 2021, ArXiv.

[4]  Anders Lansner,et al.  Semi-supervised learning with Bayesian Confidence Propagation Neural Network , 2021, ESANN 2021 proceedings.

[5]  Andrew Gordon Wilson,et al.  Dangers of Bayesian Model Averaging under Covariate Shift , 2021, NeurIPS.

[6]  Quoc V. Le,et al.  CoAtNet: Marrying Convolution and Attention for All Data Sizes , 2021, NeurIPS.

[7]  Andrew Gordon Wilson,et al.  What Are Bayesian Neural Network Posteriors Really Like? , 2021, ICML.

[8]  Andrew Gordon Wilson,et al.  Fast Adaptation with Linearized Neural Networks , 2021, AISTATS.

[9]  Svetha Venkatesh,et al.  Semi-Supervised Learning with Variational Bayesian Inference and Maximum Uncertainty Regularization , 2020, AAAI.

[10]  S. Gelly,et al.  An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale , 2020, ICLR.

[11]  Thang D. Bui,et al.  Variational Auto-Regressive Gaussian Processes for Continual Learning , 2020, ICML.

[12]  Elliot J. Crowley,et al.  Bayesian Meta-Learning for the Few-Shot Setting via Deep Kernels , 2020, NeurIPS.

[13]  Mohammad Emtiyaz Khan,et al.  Continual Deep Learning by Functional Regularisation of Memorable Past , 2020, NeurIPS.

[14]  Rohitash Chandra,et al.  Bayesian neural multi-source transfer learning , 2020, Neurocomputing.

[15]  Pavel Izmailov,et al.  Bayesian Deep Learning and a Probabilistic Perspective of Generalization , 2020, NeurIPS.

[16]  Geoffrey E. Hinton,et al.  A Simple Framework for Contrastive Learning of Visual Representations , 2020, ICML.

[17]  Ross B. Girshick,et al.  Momentum Contrast for Unsupervised Visual Representation Learning , 2019, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[18]  Micah Goldblum,et al.  Understanding Generalization through Visualizations , 2019, ICBINB@NeurIPS.

[19]  Trevor Darrell,et al.  Uncertainty-guided Continual Learning with Bayesian Neural Networks , 2019, ICLR.

[20]  Andrew Gordon Wilson,et al.  Cyclical Stochastic Gradient MCMC for Bayesian Deep Learning , 2019, ICLR.

[21]  Andrew Gordon Wilson,et al.  A Simple Baseline for Bayesian Uncertainty in Deep Learning , 2019, NeurIPS.

[22]  T. Lillicrap,et al.  Noise Contrastive Priors for Functional Uncertainty , 2018, UAI.

[23]  Ming-Wei Chang,et al.  BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.

[24]  Benjamin Recht,et al.  Do CIFAR-10 Classifiers Generalize to CIFAR-10? , 2018, ArXiv.

[25]  Stefano Ermon,et al.  Semi-supervised Deep Kernel Learning: Regression with Unlabeled Data by Minimizing Predictive Variance , 2018, NeurIPS.

[26]  Andrew Gordon Wilson,et al.  Averaging Weights Leads to Wider Optima and Better Generalization , 2018, UAI.

[27]  George Papandreou,et al.  Encoder-Decoder with Atrous Separable Convolution for Semantic Image Segmentation , 2018, ECCV.

[28]  Sebastian Ruder,et al.  Universal Language Model Fine-tuning for Text Classification , 2018, ACL.

[29]  Edward R. Dougherty,et al.  Optimal Bayesian Transfer Learning , 2018, IEEE Transactions on Signal Processing.

[30]  Hao Li,et al.  Visualizing the Loss Landscape of Neural Nets , 2017, NeurIPS.

[31]  Richard E. Turner,et al.  Variational Continual Learning , 2017, ICLR.

[32]  Hanna Tseran Natural Variational Continual Learning , 2018 .

[33]  George Papandreou,et al.  Rethinking Atrous Convolution for Semantic Image Segmentation , 2017, ArXiv.

[34]  Razvan Pascanu,et al.  Overcoming catastrophic forgetting in neural networks , 2016, Proceedings of the National Academy of Sciences.

[35]  Sebastian Ramos,et al.  The Cityscapes Dataset for Semantic Urban Scene Understanding , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[36]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[37]  Kaiming He,et al.  Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks , 2015, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[38]  Tianqi Chen,et al.  Stochastic Gradient Hamiltonian Monte Carlo , 2014, ICML.

[39]  C. V. Jawahar,et al.  Cats and dogs , 2012, 2012 IEEE Conference on Computer Vision and Pattern Recognition.

[40]  Yee Whye Teh,et al.  Bayesian Learning via Stochastic Gradient Langevin Dynamics , 2011, ICML.

[41]  Luc Van Gool,et al.  The Pascal Visual Object Classes (VOC) Challenge , 2010, International Journal of Computer Vision.

[42]  Fei-Fei Li,et al.  ImageNet: A large-scale hierarchical image database , 2009, 2009 IEEE Conference on Computer Vision and Pattern Recognition.

[43]  Alex Krizhevsky,et al.  Learning Multiple Layers of Features from Tiny Images , 2009 .

[44]  Andrew Zisserman,et al.  Automated Flower Classification over a Large Number of Classes , 2008, 2008 Sixth Indian Conference on Computer Vision, Graphics & Image Processing.