Characterizing Structural Regularities of Labeled Data in Overparameterized Models

Human learners appreciate that observations usually form hierarchies of regularities and sub-regularities. For example, English verbs have irregular cases that must be memorized (e.g., go -> went) and regular cases that generalize well (e.g., kiss -> kissed, miss -> missed). Likewise, deep neural networks have the capacity to memorize rare or irregular forms but nonetheless generalize across instances that share common patterns or structures. We analyze how individual instances are treated by a model via a consistency score. The score is the expected accuracy of a particular architecture for a held-out instance on a training set of a given size sampled from the data distribution. We obtain empirical estimates of this score for individual instances in multiple data sets, and we show that the score identifies out-of-distribution and mislabeled examples at one end of the continuum and regular examples at the other end. We explore two categories of proxies to the consistency score: pairwise distance based proxy and the training statistics based proxies. We conclude with two applications using C-scores to help understand the dynamics of representation learning and filter out outliers, and discussions of other potential applications such as curriculum learning, and active data collection.

[1]  Michael C. Mozer,et al.  Adapted Deep Embeddings: A Synthesis of Methods for k-Shot Inductive Transfer Learning , 2018, NeurIPS.

[2]  Yoshua Bengio,et al.  A Closer Look at Memorization in Deep Networks , 2017, ICML.

[3]  Yoshua Bengio,et al.  An Empirical Study of Example Forgetting during Deep Neural Network Learning , 2018, ICLR.

[4]  俊一 甘利 5分で分かる!? 有名論文ナナメ読み:Jacot, Arthor, Gabriel, Franck and Hongler, Clement : Neural Tangent Kernel : Convergence and Generalization in Neural Networks , 2020 .

[5]  Mohammad Norouzi,et al.  Big Self-Supervised Models are Strong Semi-Supervised Learners , 2020, NeurIPS.

[6]  Dennis DeCoste,et al.  Data Parameters: A New Family of Parameters for Learning a Differentiable Curriculum , 2019, NeurIPS.

[7]  Kilian Q. Weinberger,et al.  Detecting Noisy Training Data with Loss Curves , 2019 .

[8]  Richard Socher,et al.  Improving Generalization Performance by Switching from Adam to SGD , 2017, ArXiv.

[9]  Nicholas Carlini,et al.  Prototypical Examples in Deep Learning: Metrics, Characteristics, and Utility , 2018 .

[10]  Yoram Singer,et al.  Train faster, generalize better: Stability of stochastic gradient descent , 2015, ICML.

[11]  Quoc V. Le,et al.  EfficientNet: Rethinking Model Scaling for Convolutional Neural Networks , 2019, ICML.

[12]  Nathan Srebro,et al.  The Marginal Value of Adaptive Gradient Methods in Machine Learning , 2017, NIPS.

[13]  Andrew Y. Ng,et al.  Reading Digits in Natural Images with Unsupervised Feature Learning , 2011 .

[14]  Tengyuan Liang,et al.  Just Interpolate: Kernel "Ridgeless" Regression Can Generalize , 2018, The Annals of Statistics.

[15]  Alex Krizhevsky,et al.  Learning Multiple Layers of Features from Tiny Images , 2009 .

[16]  Julien Mairal,et al.  Emerging Properties in Self-Supervised Vision Transformers , 2021, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).

[17]  Jason Weston,et al.  Curriculum learning , 2009, ICML '09.

[18]  Rajeev Rastogi,et al.  Efficient algorithms for mining outliers from large data sets , 2000, SIGMOD 2000.

[19]  Geoffrey E. Hinton,et al.  A Simple Framework for Contrastive Learning of Visual Representations , 2020, ICML.

[20]  Mikhail Belkin,et al.  Two models of double descent for weak features , 2019, SIAM J. Math. Data Sci..

[21]  Alexander Kolesnikov,et al.  MLP-Mixer: An all-MLP Architecture for Vision , 2021, ArXiv.

[22]  Xu Sun,et al.  Adaptive Gradient Methods with Dynamic Bound of Learning Rate , 2019, ICLR.

[23]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[24]  Vinay Uday Prabhu,et al.  Do deep neural networks learn shallow learnable examples first , 2019 .

[25]  Simon Haykin,et al.  GradientBased Learning Applied to Document Recognition , 2001 .

[26]  Hans-Peter Kriegel,et al.  LOF: identifying density-based local outliers , 2000, SIGMOD '00.

[27]  Matthieu Cord,et al.  ResMLP: Feedforward Networks for Image Classification With Data-Efficient Training , 2021, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[28]  Larry S. Davis,et al.  BlockDrop: Dynamic Inference Paths in Residual Networks , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[29]  Vitaly Feldman,et al.  Does learning require memorization? a short tale about a long tail , 2019, STOC.

[30]  Georg Heigold,et al.  An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale , 2021, ICLR.

[31]  Michal Valko,et al.  Bootstrap Your Own Latent: A New Approach to Self-Supervised Learning , 2020, NeurIPS.

[32]  Ilya Sutskever,et al.  Language Models are Unsupervised Multitask Learners , 2019 .

[33]  James L. McClelland,et al.  On learning the past-tenses of English verbs: implicit rules or parallel distributed processing , 1986 .

[34]  Vitaly Feldman,et al.  What Neural Networks Memorize and Why: Discovering the Long Tail via Influence Estimation , 2020, NeurIPS.

[35]  Michael S. Bernstein,et al.  ImageNet Large Scale Visual Recognition Challenge , 2014, International Journal of Computer Vision.

[36]  Luke Melas-Kyriazi,et al.  Do You Even Need Attention? A Stack of Feed-Forward Layers Does Surprisingly Well on ImageNet , 2021, ArXiv.

[37]  Mikhail Belkin,et al.  To understand deep learning we need to understand kernel learning , 2018, ICML.