L1-regularized Neural Networks are Improperly Learnable in Polynomial Time

We study the improper learning of multi-layer neural networks. Suppose that the neural network to be learned has k hidden layers and that the l1-norm of the incoming weights of any neuron is bounded by L. We present a kernel-based method, such that with probability at least 1 - δ, it learns a predictor whose generalization error is at most e worse than that of the neural network. The sample complexity and the time complexity of the presented method are polynomial in the input dimension and in (1/e, log(1/δ), F(k, L)), where F(k, L) is a function depending on (k, L) and on the activation function, independent of the number of neurons. The algorithm applies to both sigmoid-like activation functions and ReLU-like activation functions. It implies that any sufficiently sparse neural network is learnable in polynomial time.

[1]  Ambuj Tewari,et al.  On the Complexity of Linear Prediction: Risk Bounds, Margin Bounds, and Regularization , 2008, NIPS.

[2]  Aditya Bhaskara,et al.  Provable Bounds for Learning Some Deep Representations , 2013, ICML.

[3]  Nicolas Le Roux,et al.  Convex Neural Networks , 2005, NIPS.

[4]  Lawrence K. Saul,et al.  Kernel Methods for Deep Learning , 2009, NIPS.

[5]  Anima Anandkumar,et al.  Generalization Bounds for Neural Networks through Tensor Factorization , 2015, ArXiv.

[6]  Danqi Chen,et al.  A Fast and Accurate Dependency Parser using Neural Networks , 2014, EMNLP.

[7]  Rob Fergus,et al.  Visualizing and Understanding Convolutional Networks , 2013, ECCV.

[8]  Alexander A. Sherstov,et al.  Cryptographic Hardness for Learning Intersections of Halfspaces , 2006, 2006 47th Annual IEEE Symposium on Foundations of Computer Science (FOCS'06).

[9]  Roi Livni,et al.  On the Computational Efficiency of Training Neural Networks , 2014, NIPS.

[10]  Tara N. Sainath,et al.  Improving deep neural networks for LVCSR using rectified linear units and dropout , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[11]  Andrew R. Barron,et al.  Universal approximation bounds for superpositions of a sigmoidal function , 1993, IEEE Trans. Inf. Theory.

[12]  Yann LeCun,et al.  The Loss Surface of Multilayer Networks , 2014, ArXiv.

[13]  Anima Anandkumar,et al.  Beating the Perils of Non-Convexity: Guaranteed Training of Neural Networks using Tensor Methods , 2017 .

[14]  Ronald L. Rivest,et al.  Training a 3-node neural network is NP-complete , 1988, COLT '88.

[15]  Alexandr Andoni,et al.  Learning Polynomials with Neural Networks , 2014, ICML.

[16]  Günther Palm,et al.  Sparse activity and sparse connectivity in supervised learning , 2016, J. Mach. Learn. Res..

[17]  Yann LeCun,et al.  The Loss Surfaces of Multilayer Networks , 2014, AISTATS.

[18]  Cordelia Schmid,et al.  Convolutional Kernel Networks , 2014, NIPS.

[19]  Geoffrey E. Hinton,et al.  ImageNet classification with deep convolutional neural networks , 2012, Commun. ACM.

[20]  Yoshua Bengio,et al.  Gradient-based learning applied to document recognition , 1998, Proc. IEEE.

[21]  P. Cochat,et al.  Et al , 2008, Archives de pediatrie : organe officiel de la Societe francaise de pediatrie.

[22]  Antonio Auffinger,et al.  Random Matrices and Complexity of Spin Glasses , 2010, 1003.1129.

[23]  Surya Ganguli,et al.  Identifying and attacking the saddle point problem in high-dimensional non-convex optimization , 2014, NIPS.

[24]  Tara N. Sainath,et al.  FUNDAMENTAL TECHNOLOGIES IN MODERN SPEECH RECOGNITION Digital Object Identifier 10.1109/MSP.2012.2205597 , 2012 .

[25]  Anima Anandkumar,et al.  Provable Methods for Training Neural Networks with Sparse Connectivity , 2014, ICLR.

[26]  Ohad Shamir,et al.  Learning Kernel-Based Halfspaces with the 0-1 Loss , 2011, SIAM J. Comput..

[27]  Marc'Aurelio Ranzato,et al.  Building high-level features using large scale unsupervised learning , 2011, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.