Generalization in deep network classifiers trained with the square loss1

Square loss has been observed to perform well in classification tasks. However, a theoretical justification is so far lacking, unlike the cross-entropy case. Here we discuss several observations on the dynamics of gradient flow under the square loss in ReLU networks. We show that convergence to a solution with the absolute minimum norm is expected when normalization techniques such as Batch Normalization (BN) are used together with Weight Decay (WD). In the absence of BN+WD, good solutions for classification may still be achieved because of the implicit bias towards small norm solutions in the GD dynamics introduced by close-to-zero initial conditions. The main property of the minimizers that bounds their expected error is the norm: we prove that among all the close-tointerpolating solutions, the ones associated with smaller Frobenius norms of the unnormalized weight matrices have better margin and better bounds on the expected classification error. The theory yields several predictions, including the role of BN and weight decay, aspects of Papyan, Han and Donoho’s Neural Collapse and the constraints induced by BN on the network weights. This material is based upon work supported by the Center for Brains, Minds and Machines (CBMM), funded by NSF STC award CCF-1231216. Generalization in deep network classifiers trained with the square loss Tomaso Poggio and Qianli Liao Abstract Square loss has been observed to perform well in classification tasks. However, a theoretical justification is lacking, unlike the cross-entropy [1] case for which an asymptotic analysis has been proposed (see [2] and [3] and references therein). Here we discuss several observations on the dynamics of gradient flow under the square loss in ReLU networks. We show that convergence to a solution with the absolute minimum norm is expected when normalization techniques such as Batch Normalization[4] (BN) or Weight Normalization[5] (WN) are used together with Weight Decay (WD). In the absence of BN+WD, good solutions for classification may still be achieved because of the implicit bias towards small norm solutions in the GD dynamics introduced by close-to-zero initial conditions. The main property of the minimizers that bounds their expected error is the norm: we prove that among all the close-to-interpolating solutions, the ones associated with smaller Frobenius norms of the unnormalized weight matrices have better margin and better bounds on the expected classification error. The theory yields several predictions, including the role of BN and weight decay, aspects of Papyan, Han and Donoho’s Neural Collapse and the constraints induced by BN on the network weights.Square loss has been observed to perform well in classification tasks. However, a theoretical justification is lacking, unlike the cross-entropy [1] case for which an asymptotic analysis has been proposed (see [2] and [3] and references therein). Here we discuss several observations on the dynamics of gradient flow under the square loss in ReLU networks. We show that convergence to a solution with the absolute minimum norm is expected when normalization techniques such as Batch Normalization[4] (BN) or Weight Normalization[5] (WN) are used together with Weight Decay (WD). In the absence of BN+WD, good solutions for classification may still be achieved because of the implicit bias towards small norm solutions in the GD dynamics introduced by close-to-zero initial conditions. The main property of the minimizers that bounds their expected error is the norm: we prove that among all the close-to-interpolating solutions, the ones associated with smaller Frobenius norms of the unnormalized weight matrices have better margin and better bounds on the expected classification error. The theory yields several predictions, including the role of BN and weight decay, aspects of Papyan, Han and Donoho’s Neural Collapse and the constraints induced by BN on the network weights.

[1]  Colin Wei,et al.  Shape Matters: Understanding the Implicit Bias of the Noise Covariance , 2020, COLT.

[2]  Matus Telgarsky,et al.  Spectrally-normalized margin bounds for neural networks , 2017, NIPS.

[3]  Amit Daniely,et al.  The Implicit Bias of Depth: How Incremental Learning Drives Generalization , 2020, ICLR.

[4]  Tomaso Poggio,et al.  Loss landscape: SGD has a better view , 2020 .

[5]  Nathan Srebro,et al.  Lexicographic and Depth-Sensitive Margins in Homogeneous and Non-Homogeneous Deep Models , 2019, ICML.

[6]  Quynh Nguyen,et al.  On Connected Sublevel Sets in Deep Learning , 2019, ICML.

[7]  Tim Salimans,et al.  Weight Normalization: A Simple Reparameterization to Accelerate Training of Deep Neural Networks , 2016, NIPS.

[8]  Mikhail Belkin,et al.  Evaluation of Neural Architectures Trained with Square Loss vs Cross-Entropy in Classification Tasks , 2020, ICLR.

[9]  Daniel Kunin,et al.  Loss Landscapes of Regularized Linear Autoencoders , 2019, ICML.

[10]  Tomaso A. Poggio,et al.  Fisher-Rao Metric, Geometry, and Complexity of Neural Networks , 2017, AISTATS.

[11]  Dacheng Tao,et al.  Orthogonal Deep Neural Networks , 2019, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[12]  Gábor Lugosi,et al.  Introduction to Statistical Learning Theory , 2004, Advanced Lectures on Machine Learning.

[13]  R. Douglas,et al.  Neuronal circuits of the neocortex. , 2004, Annual review of neuroscience.

[14]  David L. Donoho,et al.  Prevalence of neural collapse during the terminal phase of deep learning training , 2020, Proceedings of the National Academy of Sciences.

[15]  Paulo Jorge S. G. Ferreira,et al.  The existence and uniqueness of the minimum norm solution to certain linear and nonlinear problems , 1996, Signal Process..

[16]  Sergey Ioffe,et al.  Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift , 2015, ICML.

[17]  Nathan Srebro,et al.  The Implicit Bias of Gradient Descent on Separable Data , 2017, J. Mach. Learn. Res..

[18]  Kaifeng Lyu,et al.  Gradient Descent Maximizes the Margin of Homogeneous Neural Networks , 2019, ICLR.

[19]  Tomaso Poggio,et al.  Loss landscape: SGD can have a better view than GD , 2020 .

[20]  Sanjeev Arora,et al.  Theoretical Analysis of Auto Rate-Tuning by Batch Normalization , 2018, ICLR.

[21]  Mikhail Belkin,et al.  Classification vs regression in overparameterized regimes: Does the loss function matter? , 2020, J. Mach. Learn. Res..

[22]  Tomaso Poggio,et al.  Complexity control by gradient descent in deep networks , 2020, Nature Communications.

[23]  Qianli Liao,et al.  Theoretical issues in deep networks , 2020, Proceedings of the National Academy of Sciences.