On Monotonic Linear Interpolation of Neural Network Parameters

Linearly interpolating between initial neural network parameters and converged parameters after training with SGD typically leads to a monotonic decrease in the training objective. This Monotonic Linear Interpolation (MLI) property, first observed by Goodfellow et al. [11], persists in spite of the non-convex objectives and highly non-linear training dynamics of neural networks. Extending on this work, we show that this property holds under varying network architectures, optimizers, and learning problems. We evaluate several possible hypotheses for this property that, to our knowledge, have not yet been explored. Additionally, we show that networks violating this property can be produced systematically, by forcing the weights to move far from initialization. The MLI property raises important questions about the loss landscape geometry of neural nets and highlights the need to further study its global properties.

[1]  Gintare Karolina Dziugaite,et al.  Stabilizing the Lottery Ticket Hypothesis , 2019 .

[2]  Roland Vollgraf,et al.  Fashion-MNIST: a Novel Image Dataset for Benchmarking Machine Learning Algorithms , 2017, ArXiv.

[3]  Guodong Zhang,et al.  Which Algorithmic Choices Matter at Which Batch Sizes? Insights From a Noisy Quadratic Model , 2019, NeurIPS.

[4]  Gilad Yehudai,et al.  Proving the Lottery Ticket Hypothesis: Pruning is All You Need , 2020, ICML.

[5]  Sanjeev Arora,et al.  Theoretical Analysis of Auto Rate-Tuning by Batch Normalization , 2018, ICLR.

[6]  Quynh Nguyen,et al.  On Connected Sublevel Sets in Deep Learning , 2019, ICML.

[7]  Lukasz Kaiser,et al.  Attention is All you Need , 2017, NIPS.

[8]  Gintare Karolina Dziugaite,et al.  Linear Mode Connectivity and the Lottery Ticket Hypothesis , 2019, ICML.

[9]  Michael Carbin,et al.  The Lottery Ticket Hypothesis: Finding Sparse, Trainable Neural Networks , 2018, ICLR.

[10]  Sergey Ioffe,et al.  Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift , 2015, ICML.

[11]  Shun-ichi Amari,et al.  When Does Preconditioning Help or Hurt Generalization? , 2021, ICLR.

[12]  Stanislav Fort,et al.  The Goldilocks zone: Towards better understanding of neural network loss landscapes , 2018, AAAI.

[13]  Geoffrey E. Hinton,et al.  Lookahead Optimizer: k steps forward, 1 step back , 2019, NeurIPS.

[14]  Marco Mondelli,et al.  Landscape Connectivity and Dropout Stability of SGD Solutions for Over-parameterized Neural Networks , 2020, ICML.

[15]  Jason Yosinski,et al.  Measuring the Intrinsic Dimension of Objective Landscapes , 2018, ICLR.

[16]  Andrew Gordon Wilson,et al.  Averaging Weights Leads to Wider Optima and Better Generalization , 2018, UAI.

[17]  Oriol Vinyals,et al.  Qualitatively characterizing neural network optimization problems , 2014, ICLR.

[18]  Jaehoon Lee,et al.  Wide neural networks of any depth evolve as linear models under gradient descent , 2019, NeurIPS.

[19]  Andrew Gordon Wilson,et al.  Loss Surfaces, Mode Connectivity, and Fast Ensembling of DNNs , 2018, NeurIPS.

[20]  Sanjeev Arora,et al.  Explaining Landscape Connectivity of Low-cost Solutions for Multilayer Nets , 2019, NeurIPS.

[21]  Razvan Pascanu,et al.  Sharp Minima Can Generalize For Deep Nets , 2017, ICML.

[22]  Jascha Sohl-Dickstein,et al.  The large learning rate phase of deep learning: the catapult mechanism , 2020, ArXiv.

[23]  Alex Krizhevsky,et al.  Learning Multiple Layers of Features from Tiny Images , 2009 .

[24]  Surya Ganguli,et al.  Emergent properties of the local geometry of neural loss landscapes , 2019, ArXiv.

[25]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[26]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[27]  Ethan Dyer,et al.  Gradient Descent Happens in a Tiny Subspace , 2018, ArXiv.

[28]  Boris Polyak,et al.  Acceleration of stochastic approximation by averaging , 1992 .

[29]  D. Ruppert,et al.  Efficient Estimations from a Slowly Convergent Robbins-Monro Process , 1988 .

[30]  Tom Schaul,et al.  No more pesky learning rates , 2012, ICML.

[31]  Renjie Liao,et al.  Understanding Short-Horizon Bias in Stochastic Meta-Optimization , 2018, ICLR.

[32]  Andrew Zisserman,et al.  Very Deep Convolutional Networks for Large-Scale Image Recognition , 2014, ICLR.

[33]  Fred A. Hamprecht,et al.  Essentially No Barriers in Neural Network Energy Landscape , 2018, ICML.

[34]  Arthur Jacot,et al.  Neural tangent kernel: convergence and generalization in neural networks (invited paper) , 2018, NeurIPS.