Nonparametric Weight Initialization of Neural Networks via Integral Representation

A new initialization method for hidden parameters in a neural network is proposed. Derived from the integral representation of the neural network, a nonparametric probability distribution of hidden parameters is introduced. In this proposal, hidden parameters are initialized by samples drawn from this distribution, and output parameters are fitted by ordinary linear regression. Numerical experiments show that backpropagation with proposed initialization converges faster than uniformly random initialization. Also it is shown that the proposed method achieves enough accuracy by itself without backpropagation in some cases.

[1]  Yann LeCun,et al.  The mnist database of handwritten digits , 2005 .

[2]  Tommy W. S. Chow,et al.  A weight initialization method for improving training speed in feedforward neural network , 2000, Neurocomputing.

[3]  Mohamed-Jalal Fadili,et al.  Curvelets and Ridgelets , 2009, Encyclopedia of Complexity and Systems Science.

[4]  Bernard Widrow,et al.  Improving the learning speed of 2-layer neural networks by choosing initial values of the adaptive weights , 1990, 1990 IJCNN International Joint Conference on Neural Networks.

[5]  R Core Team,et al.  R: A language and environment for statistical computing. , 2014 .

[6]  L. Jones A Simple Lemma on Greedy Approximation in Hilbert Space and Convergence Rates for Projection Pursuit Regression and Neural Network Training , 1992 .

[7]  Yee Whye Teh,et al.  A Fast Learning Algorithm for Deep Belief Nets , 2006, Neural Computation.

[8]  D. Sprecher On the structure of continuous functions of several variables , 1965 .

[9]  Thomas Hofmann,et al.  Greedy Layer-Wise Training of Deep Networks , 2007 .

[10]  David A. Sprecher,et al.  A Numerical Implementation of Kolmogorov's Superpositions , 1996, Neural Networks.

[11]  Hisashi Shimodaira A weight value initialization method for improving learning performance of the backpropagation algorithm in neural networks , 1994, Proceedings Sixth International Conference on Tools with Artificial Intelligence. TAI 94.

[12]  Arnaud Doucet,et al.  Sequential Monte Carlo Methods to Train Neural Network Models , 2000, Neural Computation.

[13]  Marc'Aurelio Ranzato,et al.  Building high-level features using large scale unsupervised learning , 2011, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[14]  Yoshua Bengio,et al.  Gradient-based learning applied to document recognition , 1998, Proc. IEEE.

[15]  Noboru Murata,et al.  An Integral Representation of Functions Using Three-layered Networks and Their Approximation Bounds , 1996, Neural Networks.

[16]  Pascal Vincent,et al.  Representation Learning: A Review and New Perspectives , 2012, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[17]  V. Tikhomirov On the Representation of Continuous Functions of Several Variables as Superpositions of Continuous Functions of one Variable and Addition , 1991 .

[18]  Vra Krkov Kolmogorov's Theorem Is Relevant , 1991, Neural Computation.

[19]  David A. Sprecher,et al.  A Numerical Implementation of Kolmogorov's Superpositions II , 1996, Neural Networks.

[20]  Klaus-Robert Müller,et al.  Efficient BackProp , 2012, Neural Networks: Tricks of the Trade.

[21]  Dong Yu,et al.  Context-Dependent Pre-Trained Deep Neural Networks for Large-Vocabulary Speech Recognition , 2012, IEEE Transactions on Audio, Speech, and Language Processing.

[22]  G. Lewicki,et al.  Approximation by Superpositions of a Sigmoidal Function , 2003 .

[23]  Radford M. Neal Pattern Recognition and Machine Learning , 2007, Technometrics.

[24]  J. F. Shepanski Fast learning in artificial neural systems: multilayer perceptron training using optimal estimation , 1988, IEEE 1988 International Conference on Neural Networks.

[25]  Ken-ichi Funahashi,et al.  On the approximate realization of continuous mappings by neural networks , 1989, Neural Networks.

[26]  Andrew R. Barron,et al.  Universal approximation bounds for superpositions of a sigmoidal function , 1993, IEEE Trans. Inf. Theory.

[27]  Pierre Priouret,et al.  Adaptive Algorithms and Stochastic Approximations , 1990, Applications of Mathematics.

[28]  K. Friedrichs The identity of weak and strong extensions of differential operators , 1944 .

[29]  Thierry Denoeux,et al.  Initializing back propagation networks with prototypes , 1993, Neural Networks.

[30]  S. M. Carroll,et al.  Construction of neural nets using the radon transform , 1989, International 1989 Joint Conference on Neural Networks.