Gradient Descent with Identity Initialization Efficiently Learns Positive-Definite Linear Transformations by Deep Residual Networks

We analyze algorithms for approximating a function f(x)=Φx mapping ℜd to ℜd using deep linear neural networks, that is, that learn a function h parameterized by matrices Θ1,…,ΘL and defined by h(x)=ΘLΘL-1…Θ1x. We focus on algorithms that learn through gradient descent on the population quadratic loss in the case that the distribution over the inputs is isotropic. We provide polynomial bounds on the number of iterations for gradient descent to approximate the least-squares matrix Φ, in the case where the initial hypothesis Θ1=…=ΘL=I has excess loss bounded by a small enough constant. We also show that gradient descent fails to converge for Φ whose distance from the identity is a larger constant, and we show that some forms of regularization toward the identity in each layer do not help. If Φ is symmetric positive definite, we show that an algorithm that initializes Θi=I learns an ε-approximation of f using a number of updates polynomial in L, the condition number of Φ, and log(d/ε). In contrast, we show that if the least-squares matrix Φ is symmetric and has a negative eigenvalue, then all members of a class of algorithms that perform gradient descent with identity initialization, and optionally regularize toward the identity in each layer, fail to converge. We analyze an algorithm for the case that Φ satisfies u⊤Φu>0 for all u but may not be symmetric. This algorithm uses two regularizers: one that maintains the invariant u⊤ΘLΘL-1…Θ1u>0 for all u and the other that “balances” Θ1,…,ΘL so that they have the same singular values.

[1]  W. Culver On the existence and uniqueness of the real logarithm of a matrix , 1966 .

[2]  Charles R. Johnson,et al.  Matrix analysis , 1985, Statistical Inference for Engineers and Data Scientists.

[3]  Charles R. Johnson,et al.  Topics in Matrix Analysis , 1991 .

[4]  Peter L. Bartlett,et al.  Efficient agnostic learning of neural networks with bounded fan-in , 1996, IEEE Trans. Inf. Theory.

[5]  D. Harville Matrix Algebra From a Statistician's Perspective , 1998 .

[6]  Stephen P. Boyd,et al.  Convex Optimization , 2004, Algorithms and Theory of Computation Handbook.

[7]  Charles R. Johnson,et al.  Matrix Analysis, 2nd Ed , 2012 .

[8]  Michael Unser,et al.  Hessian Schatten-Norm Regularization for Linear Inverse Problems , 2012, IEEE Transactions on Image Processing.

[9]  Aditya Bhaskara,et al.  Provable Bounds for Learning Some Deep Representations , 2013, ICML.

[10]  Roi Livni,et al.  On the Computational Efficiency of Training Neural Networks , 2014, NIPS.

[11]  Surya Ganguli,et al.  Exact solutions to the nonlinear dynamics of learning in deep linear neural networks , 2013, ICLR.

[12]  Alexandr Andoni,et al.  Learning Polynomials with Neural Networks , 2014, ICML.

[13]  Yuchen Zhang,et al.  L1-regularized Neural Networks are Improperly Learnable in Polynomial Time , 2015, ICML.

[14]  Kenji Kawaguchi,et al.  Deep Learning without Poor Local Minima , 2016, NIPS.

[15]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[16]  Ohad Shamir,et al.  On the Quality of the Initial Basin in Overspecified Neural Networks , 2015, ICML.

[17]  Michael I. Jordan,et al.  Gradient Descent Only Converges to Minimizers , 2016, COLT.

[18]  Yoram Singer,et al.  Toward Deeper Understanding of Neural Networks: The Power of Initialization and a Dual View on Expressivity , 2016, NIPS.

[19]  Yi Zheng,et al.  No Spurious Local Minima in Nonconvex Low Rank Problems: A Unified Geometric Analysis , 2017, ICML.

[20]  Yuanzhi Li,et al.  Convergence Analysis of Two-layer Neural Networks with ReLU Activation , 2017, NIPS.

[21]  Anima Anandkumar,et al.  Beating the Perils of Non-Convexity: Guaranteed Training of Neural Networks using Tensor Methods , 2017 .

[22]  Amit Daniely,et al.  SGD Learns the Conjugate Kernel Class of the Network , 2017, NIPS.

[23]  Prateek Jain,et al.  Global Convergence of Non-Convex Gradient Descent for Computing Matrix Squareroot , 2015, AISTATS.

[24]  Derong Liu,et al.  Neural Information Processing , 2017, Lecture Notes in Computer Science.

[25]  Rina Panigrahy,et al.  Electron-Proton Dynamics in Deep Learning , 2017, ArXiv.

[26]  Tengyu Ma,et al.  Identity Matters in Deep Learning , 2016, ICLR.

[27]  Martin J. Wainwright,et al.  On the Learnability of Fully-Connected Neural Networks , 2017, AISTATS.

[28]  Inderjit S. Dhillon,et al.  Recovery Guarantees for One-hidden-layer Neural Networks , 2017, ICML.

[29]  Amirhossein Taghvaei,et al.  How regularization affects the critical points in linear networks , 2017, NIPS.

[30]  Amir Globerson,et al.  Globally Optimal Gradient Descent for a ConvNet with Gaussian Inputs , 2017, ICML.

[31]  Matthias Hein,et al.  The Loss Surface of Deep and Wide Neural Networks , 2017, ICML.

[32]  Tengyu Ma,et al.  Learning One-hidden-layer Neural Networks with Landscape Design , 2017, ICLR.

[33]  Xaq Pitkow,et al.  Skip Connections Eliminate Singularities , 2017, ICLR.

[34]  Philip M. Long,et al.  Representing smooth functions as compositions of near-identity functions with implications for deep network optimization , 2018, ArXiv.

[35]  Shai Shalev-Shwartz,et al.  SGD Learns Over-parameterized Networks that Provably Generalize on Linearly Separable Data , 2017, ICLR.

[36]  Philip M. Long,et al.  Gradient descent efficiently learns positive definite deep linear residual networks , 2018 .

[37]  Ohad Shamir,et al.  Exponential Convergence Time of Gradient Descent for One-Dimensional Deep Linear Neural Networks , 2018, COLT.