论文信息 - Theory of Deep Learning III : the non-overfitting puzzle

Theory of Deep Learning III : the non-overfitting puzzle

A main puzzle of deep networks revolves around the apparent absence of overfitting intended as robustness of the expected error against overparametrization, despite the large capacity demonstrated by zero training error on randomly labeled data. In this note, we show that the dynamics associated to gradient descent minimization of nonlinear networks is topologically equivalent, near the asymptotically stable minima of the empirical error, to a gradient system in a quadratic potential with a degenerate (for square loss) or almost degenerate (for logistic or crossentropy loss) Hessian. The proposition depends on the qualitative theory of dynamical systems and is supported by numerical results. The result extends to deep nonlinear networks two key properties of gradient descent for linear networks, that have been recently recognized (1) to provide a form of implicit regularization: 1. For classification, which is the main application of today’s deep networks, there is asymptotic convergence to the maximum margin solution by minimization of loss functions such as the logistic, the cross entropy and the exp-loss . The maximum margin solution guarantees good classification error for “low noise” datasets. Importantly, this property holds independently of the initial conditions. Because of this property, our proposition guarantees a maximum margin solution also for deep nonlinear networks. 2. Gradient descent enforces a form of implicit regularization controlled by the number of iterations, and asymptotically converges to the minimum norm solution for appropriate initial conditions of gradient descent. This implies that there is usually an optimum early stopping that avoids overfitting of the expected risk. This property, valid for the square loss and many other loss functions, is relevant especially for regression. In the case of deep nonlinear networks the solution however is not expected to be strictly minimum norm, unlike the linear case. The robustness to overparametrization has suggestive implications for the robustness of the architecture of deep convolutional networks with respect to the curse of dimensionality.