LocoProp: Enhancing BackProp via Local Loss Optimization

We study a local loss construction approach for optimizing neural networks. We start by motivating the problem as minimizing a squared loss between the preactivations of each layer and a local target, plus a regularizer term on the weights. The targets are chosen so that the first gradient descent step on the local objectives recovers vanilla BackProp, while the exact solution to each problem results in a preconditioned gradient update. We improve the local loss construction by forming a Bregman divergence in each layer tailored to the transfer function which keeps the local problem convex w.r.t. the weights. The generalized local problem is again solved iteratively by taking small gradient descent steps on the weights, for which the first step recovers BackProp. We run several ablations and show that our construction consistently improves convergence, reducing the gap between first-order and second-order methods.

[1]  Manfred K. Warmuth,et al.  Relative Loss Bounds for Multidimensional Regression Problems , 1997, Machine Learning.

[2]  Martin Jaggi,et al.  Decoupling Backpropagation using Constrained Optimization Methods , 2018 .

[3]  Frederik Kunstner,et al.  Limitations of the empirical Fisher approximation for natural gradient descent , 2019, NeurIPS.

[4]  Manfred K. Warmuth,et al.  An Implicit Form of Krasulina's k-PCA Update without the Orthonormality Constraint , 2019, AAAI.

[5]  AmariShun-Ichi α-divergence is unique, belonging to both f-divergence and Bregman divergence classes , 2009 .

[6]  Yoshua Bengio,et al.  Difference Target Propagation , 2014, ECML/PKDD.

[7]  James Martens,et al.  Deep learning via Hessian-free optimization , 2010, ICML.

[8]  Leon Hirsch,et al.  Fundamentals Of Convex Analysis , 2016 .

[9]  John Darzentas,et al.  Problem Complexity and Method Efficiency in Optimization , 1983 .

[10]  Yoram Singer,et al.  Shampoo: Preconditioned Stochastic Tensor Optimization , 2018, ICML.

[11]  Donald Goldfarb,et al.  Practical Quasi-Newton Methods for Training Deep Neural Networks , 2020, NeurIPS.

[12]  Roger B. Grosse,et al.  Optimizing Neural Networks with Kronecker-factored Approximate Curvature , 2015, ICML.

[13]  Manfred K. Warmuth,et al.  Robust Bi-Tempered Logistic Loss Based on Bregman Divergences , 2019, NeurIPS.

[14]  L. Bregman The relaxation method of finding the common point of convex sets and its application to the solution of problems in convex programming , 1967 .

[15]  Roland Vollgraf,et al.  Fashion-MNIST: a Novel Image Dataset for Benchmarking Machine Learning Algorithms , 2017, ArXiv.

[16]  Ziming Zhang,et al.  Convergent Block Coordinate Descent for Training Tikhonov Regularized Deep Neural Networks , 2017, NIPS.

[17]  Yuan Yao,et al.  A Proximal Block Coordinate Descent Algorithm for Deep Neural Network Training , 2018, ICLR.

[18]  Jia Li,et al.  Lifted Proximal Operator Machines , 2018, AAAI.

[19]  Yuan Yao,et al.  Global Convergence of Block Coordinate Descent in Deep Learning , 2018, ICML.

[20]  Roger B. Grosse,et al.  Distributed Second-Order Optimization using Kronecker-Factored Approximations , 2016, ICLR.

[21]  Yoram Singer,et al.  Adaptive Subgradient Methods for Online Learning and Stochastic Optimization , 2011, J. Mach. Learn. Res..

[22]  Alex Graves,et al.  Decoupled Neural Interfaces using Synthetic Gradients , 2016, ICML.

[23]  Christopher Zach,et al.  Contrastive Learning for Lifted Networks , 2019, BMVC.

[24]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[25]  Miguel Á. Carreira-Perpiñán,et al.  ParMAC: distributed optimisation of nested functions, with application to learning binary autoencoders , 2016, MLSys.

[26]  Tom Heskes,et al.  On Natural Learning and Pruning in Multilayered Perceptrons , 2000, Neural Computation.

[27]  Satoshi Matsuoka,et al.  Second-order Optimization Method for Large Mini-batch: Training ResNet-50 on ImageNet in 35 Epochs , 2018, ArXiv.

[28]  Miguel Á. Carreira-Perpiñán,et al.  Distributed optimization of deeply nested systems , 2012, AISTATS.

[29]  Manfred K. Warmuth,et al.  Reparameterizing Mirror Descent as Gradient Descent , 2020, NeurIPS.

[30]  Shun-ichi Amari,et al.  Natural Gradient Works Efficiently in Learning , 1998, Neural Computation.

[31]  Laurent El Ghaoui,et al.  Fenchel Lifted Networks: A Lagrange Relaxation of Neural Network Training , 2018, AISTATS.

[32]  Tie-Yan Liu,et al.  On the Local Hessian in Back-propagation , 2018, NeurIPS.

[33]  Yoram Singer,et al.  Memory-Efficient Adaptive Optimization for Large-Scale Learning , 2019, ArXiv.

[34]  Noam Shazeer,et al.  Adafactor: Adaptive Learning Rates with Sublinear Memory Cost , 2018, ICML.

[35]  Geoffrey E. Hinton,et al.  Learning representations by back-propagating errors , 1986, Nature.

[36]  Manfred K. Warmuth,et al.  Relative loss bounds for single neurons , 1999, IEEE Trans. Neural Networks.

[37]  Michael Möller,et al.  Proximal Backpropagation , 2017, ICLR.

[38]  Babak Hassibi,et al.  The p-norm generalization of the LMS algorithm for adaptive filtering , 2003, IEEE Transactions on Signal Processing.

[39]  Zheng Xu,et al.  Training Neural Networks Without Gradients: A Scalable ADMM Approach , 2016, ICML.

[40]  Ali H. Sayed,et al.  H∞ optimality of the LMS algorithm , 1996, IEEE Trans. Signal Process..