On the Convergence of the LMS Algorithm with Adaptive Learning Rate for Linear Feedforward Networks

We consider the problem of training a linear feedforward neural network by using a gradient descent-like LMS learning algorithm. The objective is to find a weight matrix for the network, by repeatedly presenting to it a finite set of examples, so that the sum of the squares of the errors is minimized. Kohonen showed that with a small but fixed learning rate (or stepsize) some subsequences of the weight matrices generated by the algorithm will converge to certain matrices close to the optimal weight matrix. In this paper, we show that, by dynamically decreasing the learning rate during each training cycle, the sequence of matrices generated by the algorithm will converge to the optimal weight matrix. We also show that for any given ∊ > 0 the LMS algorithm, with decreasing learning rates, will generate an ∊-optimal weight matrix (i.e., a matrix of distance at most ∊ away from the optimal matrix) after O(1/∊) training cycles. This is in contrast to (1/∊log 1/∊) training cycles needed to generate an ∊-optimal weight matrix when the learning rate is kept fixed. We also give a general condition for the learning rates under which the LMS learning algorithm is guaranteed to converge to the optimal weight matrix.

[1]  J. Sacks Asymptotic Distribution of Stochastic Approximation Procedures , 1958 .

[2]  V. Fabian On Asymptotic Normality in Stochastic Approximation , 1968 .

[3]  Teuvo Kohonen,et al.  An Adaptive Associative Memory Principle , 1974, IEEE Transactions on Computers.

[4]  Harold J. Kushner,et al.  wchastic. approximation methods for constrained and unconstrained systems , 1978 .

[5]  Lennart Ljung,et al.  Recursive Identification Methods for Off-Line Identification Problems , 1982 .

[6]  Gene H. Golub,et al.  Matrix computations , 1983 .

[7]  Geoffrey E. Hinton,et al.  Learning internal representations by error propagation , 1986 .

[8]  James L. McClelland,et al.  Parallel distributed processing: explorations in the microstructure of cognition, vol. 1: foundations , 1986 .

[9]  Terrence J. Sejnowski,et al.  Parallel Networks that Learn to Pronounce English Text , 1987, Complex Syst..

[10]  Teuvo Kohonen,et al.  Self-Organization and Associative Memory , 1988 .

[11]  Bernard Widrow,et al.  Adaptive switching circuits , 1988 .

[12]  Terrence J. Sejnowski,et al.  Analysis of hidden units in a layered network trained to classify sonar targets , 1988, Neural Networks.

[13]  Robert A. Jacobs,et al.  Increased rates of convergence through learning rate adaptation , 1987, Neural Networks.

[14]  George Yin,et al.  Adaptive filters with constraints and correlated non-stationary signals , 1988 .

[15]  H. White Some Asymptotic Results for Learning in Single Hidden-Layer Feedforward Network Models , 1989 .

[16]  Yu He,et al.  Asymptotic Convergence of Backpropagation , 1989, Neural Computation.

[17]  John N. Tsitsiklis,et al.  Parallel and distributed computation , 1989 .

[18]  Roberto Battiti,et al.  First- and Second-Order Methods for Learning: Between Steepest Descent and Newton's Method , 1992, Neural Computation.