Scaled stochastic methods for training neural networks

The performance surfaces of large neural networks contain ravines, "flat spots," non-convex regions, and other features that make weight optimization difficult. Although a variety of sophisticated alternatives are available, the simple on-line backpropagation procedure remains the most popular method for adapting the weights of these systems. This approach performs stochastic (or incremental) steepest descent, and is significantly hampered by the character of the performance surface. Backpropagation's principal advantage over alternate methods rests in its ability to perform an update after each pattern presentation, while maintaining time and space demands that grow only linearly with the number of adaptive weights. In this dissertation, we explore new stochastic methods that improve on the learning speed of the backpropagation algorithm, while retaining its linear complexity. We begin by examining the convergence properties of two deterministic steepest descent methods. Corresponding scaled stochastic algorithms are then developed from an analysis of the neural network's Expected Mean Square Error (EMSE) sequence in the neighborhood of a local minimum of the performance surface. To maintain stable behavior under broad conditions, this development uses a general statistical model for the neural network's instantaneous Hessian matrix. For theoretical performance comparisons, however, we require a more specialized statistical framework. The corresponding analysis helps reveal the complementary convergence properties of the two updates--a relationship we exploit by combining the updates to form a family of dual-update procedures. Effective methods are established for generating a slowly varying sequence of search direction vectors and all required scaling information. The result is a practical algorithm which performs robustly when the weight vector of a large neural network is placed at arbitrary initial positions. The two weight updates are scaled by parameters computed from recursive estimates of five scalar sequences: the first and second moments of the trace of the instantaneous Hessian matrix, the first and second moments of the instantaneous gradient vector's projection along the search direction, and the first moment of the instantaneous Hessian's "projection" along the same direction.