论文信息 - Small ReLU networks are powerful memorizers: a tight analysis of memorization capacity - 字舞流文

Small ReLU networks are powerful memorizers: a tight analysis of memorization capacity

We study finite sample expressivity, i.e., memorization power of ReLU networks. Recent results require $N$ hidden nodes to memorize/interpolate arbitrary $N$ data points. In contrast, by exploiting depth, we show that 3-layer ReLU networks with $\Omega(\sqrt{N})$ hidden nodes can perfectly memorize most datasets with $N$ points. We also prove that width $\Theta(\sqrt{N})$ is necessary and sufficient for memorizing $N$ data points, proving tight bounds on memorization capacity. The sufficiency result can be extended to deeper networks; we show that an $L$-layer network with $W$ parameters in the hidden layers can memorize $N$ data points if $W = \Omega(N)$. Combined with a recent upper bound $O(WL\log W)$ on VC dimension, our construction is nearly tight for any fixed $L$. Subsequently, we analyze memorization capacity of residual networks under a general position assumption; we prove results that substantially reduce the known requirement of $N$ hidden nodes. Finally, we study the dynamics of stochastic gradient descent (SGD), and show that when initialized near a memorizing global minimum of the empirical risk, SGD quickly finds a nearby point with much smaller empirical risk.

Suvrit Sra | Ali Jadbabaie | Chulhee Yun | A. Jadbabaie | S. Sra | Chulhee Yun

[1] Yuanzhi Li,et al. A Convergence Theory for Deep Learning via Over-Parameterization , 2018, ICML.

[2] Yuandong Tian,et al. Gradient Descent Learns One-hidden-layer CNN: Don't be Afraid of Spurious Local Minima , 2017, ICML.

[3] R. Srikant,et al. Why Deep Neural Networks for Function Approximation? , 2016, ICLR.

[4] Tengyuan Liang,et al. Just Interpolate: Kernel "Ridgeless" Regression Can Generalize , 2018, The Annals of Statistics.

[5] Suvrit Sra,et al. Global optimality conditions for deep neural networks , 2017, ICLR.

[6] Eduardo D. Sontag,et al. Shattering All Sets of k Points in General Position Requires (k 1)/2 Parameters , 1997, Neural Computation.

[7] Yi Zhou,et al. SGD Converges to Global Minimum in Deep Learning via Star-convex Path , 2019, ICLR.

[8] Yuan Cao,et al. Stochastic Gradient Descent Optimizes Over-parameterized Deep ReLU Networks , 2018, ArXiv.

[9] Daniel Soudry,et al. No bad local minima: Data independent training error guarantees for multilayer neural networks , 2016, ArXiv.

[10] Dmitry Yarotsky,et al. Error bounds for approximations with deep ReLU networks , 2016, Neural Networks.

[11] Ohad Shamir,et al. Without-Replacement Sampling for Stochastic Gradient Methods , 2016, NIPS.

[12] Shai Shalev-Shwartz,et al. SGD Learns Over-parameterized Networks that Provably Generalize on Linearly Separable Data , 2017, ICLR.

[13] Guang-Bin Huang,et al. Upper bounds on the number of hidden neurons in feedforward networks with arbitrary bounded nonlinear activation functions , 1998, IEEE Trans. Neural Networks.

[14] Barnabás Póczos,et al. Gradient Descent Provably Optimizes Over-parameterized Neural Networks , 2018, ICLR.

[15] Thomas Laurent,et al. Deep Linear Networks with Arbitrary Loss: All Local Minima Are Global , 2017, ICML.

[16] Yih-Fang Huang,et al. Bounds on the number of hidden neurons in multilayer perceptrons , 1991, IEEE Trans. Neural Networks.

[17] Matthias Hein,et al. Optimization Landscape and Expressivity of Deep CNNs , 2017, ICML.

[18] Gang Wang,et al. Learning ReLU Networks on Linearly Separable Data: Algorithm, Optimality, and Generalization , 2018, IEEE Transactions on Signal Processing.

[19] G. Lewicki,et al. Approximation by Superpositions of a Sigmoidal Function , 2003 .

[20] Yoshua Bengio,et al. Shallow vs. Deep Sum-Product Networks , 2011, NIPS.

[21] Matus Telgarsky,et al. Benefits of Depth in Neural Networks , 2016, COLT.

[22] Kenji Kawaguchi,et al. Deep Learning without Poor Local Minima , 2016, NIPS.

[23] Mikhail Belkin,et al. Does data interpolation contradict statistical optimality? , 2018, AISTATS.

[24] Adam Kowalczyk,et al. Estimates of Storage Capacity of Multilayer Perceptron with Threshold Logic Hidden Units , 1997, Neural Networks.

[25] Tengyuan Liang,et al. On the Risk of Minimum-Norm Interpolants and Restricted Lower Isometry of Kernels , 2019, ArXiv.

[26] Matthias Hein,et al. Neural Networks Should Be Wide Enough to Learn Disconnected Decision Regions , 2018, ICML.

[27] Yuanzhi Li,et al. Convergence Analysis of Two-layer Neural Networks with ReLU Activation , 2017, NIPS.

[28] Samy Bengio,et al. Understanding deep learning requires rethinking generalization , 2016, ICLR.

[29] Guang-Bin Huang,et al. Learning capability and storage capacity of two-hidden-layer feedforward networks , 2003, IEEE Trans. Neural Networks.

[30] Tengyuan Liang,et al. On the Multiple Descent of Minimum-Norm Interpolants and Restricted Lower Isometry of Kernels , 2019, COLT.

[31] Liwei Wang,et al. Gradient Descent Finds Global Minima of Deep Neural Networks , 2018, ICML.

[32] Yoshua Bengio,et al. A Closer Look at Memorization in Deep Networks , 2017, ICML.

[33] Matus Telgarsky,et al. Representation Benefits of Deep Feedforward Networks , 2015, ArXiv.

[34] Ken-ichi Funahashi,et al. On the approximate realization of continuous mappings by neural networks , 1989, Neural Networks.

[35] Ohad Shamir,et al. Depth-Width Tradeoffs in Approximating Natural Functions with Neural Networks , 2016, ICML.

[36] Thomas M. Cover,et al. Geometrical and Statistical Properties of Systems of Linear Inequalities with Applications in Pattern Recognition , 1965, IEEE Trans. Electron. Comput..

[37] Mikhail Belkin,et al. Reconciling modern machine-learning practice and the classical bias–variance trade-off , 2018, Proceedings of the National Academy of Sciences.

[38] Suvrit Sra,et al. Random Shuffling Beats SGD after Finite Epochs , 2018, ICML.

[39] Mahdi Soltanolkotabi,et al. Learning ReLUs via Gradient Descent , 2017, NIPS.

[40] Abbas Mehrabian,et al. Nearly-tight VC-dimension bounds for piecewise linear neural networks , 2017, COLT.

[41] Peter L. Bartlett,et al. Almost Linear VC-Dimension Bounds for Piecewise Polynomial Networks , 1998, Neural Computation.

[42] Yuandong Tian,et al. An Analytical Formula of Population Gradient for two-layered ReLU network and its Applications in Convergence and Critical Point Analysis , 2017, ICML.

[43] Yi Zhou,et al. Critical Points of Neural Networks: Analytical Forms and Landscape Properties , 2017, ArXiv.

[44] Mikhail Belkin,et al. Reconciling modern machine learning and the bias-variance trade-off , 2018, ArXiv.

[45] Philip M. Long,et al. Benign overfitting in linear regression , 2019, Proceedings of the National Academy of Sciences.

[46] Masami Yamasaki,et al. The Lower Bound of the Capacity for a Neural Network with Multiple Hidden Layers , 1993 .

[47] Dmitry Yarotsky,et al. Optimal approximation of continuous functions by very deep ReLU networks , 2018, COLT.

[48] Kurt Hornik,et al. Multilayer feedforward networks are universal approximators , 1989, Neural Networks.

[49] Tengyu Ma,et al. Identity Matters in Deep Learning , 2016, ICLR.

[50] Yuanzhi Li,et al. Learning Overparameterized Neural Networks via Stochastic Gradient Descent on Structured Data , 2018, NeurIPS.

[51] Peter L. Bartlett,et al. Nearly-tight VC-dimension and Pseudodimension Bounds for Piecewise Linear Neural Networks , 2017, J. Mach. Learn. Res..

[52] Dong Wang,et al. Learning machines: Rationale and application in ground-level ozone prediction , 2014, Appl. Soft Comput..

[53] Amir Globerson,et al. Globally Optimal Gradient Descent for a ConvNet with Gaussian Inputs , 2017, ICML.

[54] David Rolnick,et al. The power of deeper networks for expressing natural functions , 2017, ICLR.

[55] Inderjit S. Dhillon,et al. Recovery Guarantees for One-hidden-layer Neural Networks , 2017, ICML.

[56] Ohad Shamir,et al. The Power of Depth for Feedforward Neural Networks , 2015, COLT.

[57] Andrea Montanari,et al. The Generalization Error of Random Features Regression: Precise Asymptotics and the Double Descent Curve , 2019, Communications on Pure and Applied Mathematics.

[58] Xiao Zhang,et al. Learning One-hidden-layer ReLU Networks via Gradient Descent , 2018, AISTATS.

[59] Suvrit Sra,et al. Small nonlinearities in activation functions create bad local minima in neural networks , 2018, ICLR.

[60] Ohad Shamir,et al. Spurious Local Minima are Common in Two-Layer ReLU Neural Networks , 2017, ICML.

[61] Liwei Wang,et al. The Expressive Power of Neural Networks: A View from the Width , 2017, NIPS.

[62] Eric B. Baum,et al. On the capabilities of multilayer perceptrons , 1988, J. Complex..