Distillation from heterogeneous unlabeled collections

Compressing deep networks is essential to expand their range of applications to constrained settings. The need for compression however often arises long after the model was trained, when the original data might no longer be available. On the other hand, unlabeled data, not necessarily related to the target task, is usually plentiful, especially in image classification tasks. In this work, we propose a scheme to leverage such samples to distill the knowledge learned by a large teacher network to a smaller student. The proposed technique relies on (i) preferentially sampling datapoints that appear related, and (ii) taking better advantage of the learning signal. We show that the former speeds up the student’s convergence, while the latter boosts its performance, achieving performances closed to what can be expected with the original data.

[1]  Mark Sandler,et al.  MobileNetV2: Inverted Residuals and Linear Bottlenecks , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[2]  Jiashi Feng,et al.  Revisit Knowledge Distillation: a Teacher-free Framework , 2019, ArXiv.

[3]  Irwin King,et al.  Few Shot Network Compression via Cross Distillation , 2020, AAAI.

[4]  R. Srikant,et al.  Enhancing The Reliability of Out-of-distribution Image Detection in Neural Networks , 2017, ICLR.

[5]  Junmo Kim,et al.  A Gift from Knowledge Distillation: Fast Optimization, Network Minimization and Transfer Learning , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[6]  Changshui Zhang,et al.  Few Sample Knowledge Distillation for Efficient Network Compression , 2018, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[7]  Honglak Lee,et al.  An Analysis of Single-Layer Networks in Unsupervised Feature Learning , 2011, AISTATS.

[8]  Nojun Kwak,et al.  Feature-map-level Online Adversarial Knowledge Distillation , 2020, ICML.

[9]  Pierre Geurts,et al.  Sample-free white-box out-of-distribution detection for deep learning , 2021, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW).

[10]  Kurt Keutzer,et al.  ZeroQ: A Novel Zero Shot Quantization Framework , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[11]  Xiangyu Zhang,et al.  ShuffleNet: An Extremely Efficient Convolutional Neural Network for Mobile Devices , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[12]  Qi Tian,et al.  Data-Free Learning of Student Networks , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[13]  François Fleuret,et al.  Knowledge Transfer with Jacobian Matching , 2018, ICML.

[14]  P. Alam ‘L’ , 2021, Composites Engineering: An A–Z Guide.

[15]  Ran El-Yaniv,et al.  Binarized Neural Networks , 2016, ArXiv.

[16]  Alex Krizhevsky,et al.  Learning Multiple Layers of Features from Tiny Images , 2009 .

[17]  Yoshua Bengio,et al.  Gradient-based learning applied to document recognition , 1998, Proc. IEEE.

[18]  Kilian Q. Weinberger,et al.  Densely Connected Convolutional Networks , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[19]  Jian Yang,et al.  Teaching Semi-Supervised Classifier via Generalized Distillation , 2018, IJCAI.

[20]  Alex Lamb,et al.  Deep Learning for Classical Japanese Literature , 2018, ArXiv.

[21]  Roland Vollgraf,et al.  Fashion-MNIST: a Novel Image Dataset for Benchmarking Machine Learning Algorithms , 2017, ArXiv.

[22]  Elad Hoffer,et al.  The Knowledge Within: Methods for Data-Free Model Compression , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[23]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[24]  R. Venkatesh Babu,et al.  Zero-Shot Knowledge Distillation in Deep Networks , 2019, ICML.

[25]  Kevin Gimpel,et al.  A Baseline for Detecting Misclassified and Out-of-Distribution Examples in Neural Networks , 2016, ICLR.

[26]  Jihwan P. Choi,et al.  Data-Free Network Quantization With Adversarial Knowledge Distillation , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW).

[27]  Rich Caruana,et al.  Model compression , 2006, KDD '06.

[28]  Luca Antiga,et al.  Automatic differentiation in PyTorch , 2017 .

[29]  Koh Takeuchi,et al.  Few-shot learning of neural networks from scratch by pseudo example optimization , 2018, BMVC.

[30]  Xiangyu Zhang,et al.  ShuffleNet V2: Practical Guidelines for Efficient CNN Architecture Design , 2018, ECCV.

[31]  Yoshua Bengio,et al.  FitNets: Hints for Thin Deep Nets , 2014, ICLR.

[32]  Michael Carbin,et al.  The Lottery Ticket Hypothesis: Finding Sparse, Trainable Neural Networks , 2018, ICLR.

[33]  Geoffrey E. Hinton,et al.  Distilling the Knowledge in a Neural Network , 2015, ArXiv.

[34]  Gregory J. Wolff,et al.  Optimal Brain Surgeon and general network pruning , 1993, IEEE International Conference on Neural Networks.

[35]  Dacheng Tao,et al.  Positive-Unlabeled Compression on the Cloud , 2019, NeurIPS.

[36]  Nikos Komodakis,et al.  Paying More Attention to Attention: Improving the Performance of Convolutional Neural Networks via Attention Transfer , 2016, ICLR.

[37]  Amos Storkey,et al.  Zero-shot Knowledge Transfer via Adversarial Belief Matching , 2019, NeurIPS.

[38]  Rui Zhang,et al.  KDGAN: Knowledge Distillation with Generative Adversarial Networks , 2018, NeurIPS.

[39]  Andrew Y. Ng,et al.  Reading Digits in Natural Images with Unsupervised Feature Learning , 2011 .

[40]  Yoshua Bengio,et al.  Generative Adversarial Nets , 2014, NIPS.

[41]  Wei Zhang,et al.  Learning Efficient Detector with Semi-supervised Adaptive Distillation , 2019, BMVC.

[42]  Vijay Vasudevan,et al.  Learning Transferable Architectures for Scalable Image Recognition , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.