The Vanishing Gradient Problem During Learning Recurrent Neural Nets and Problem Solutions

Recurrent nets are in principle capable to store past inputs to produce the currently desired output. Because of this property recurrent nets are used in time series prediction and process control. Practical applications involve temporal dependencies spanning many time steps, e.g. between relevant inputs and desired outputs. In this case, however, gradient based learning methods take too much time. The extremely increased learning time arises because the error vanishes as it gets propagated back. In this article the de-caying error flow is theoretically analyzed. Then methods trying to overcome vanishing gradients are briefly discussed. Finally, experiments comparing conventional algorithms and alternative methods are presented. With advanced methods long time lag problems can be solved in reasonable time.

[1]  Fernando J. Pineda,et al.  Dynamics and architecture for neural computation , 1988, J. Complex..

[2]  M. V. Rossum,et al.  In Neural Computation , 2022 .

[3]  Barak A. Pearlmutter Learning State Space Trajectories in Recurrent Neural Networks , 1988, Neural Computation.

[4]  A. W. Smith,et al.  Encoding sequential structure: experience with the real-time recurrent learning algorithm , 1989, International 1989 Joint Conference on Neural Networks.

[5]  James L. McClelland,et al.  Finite State Automata and Simple Recurrent Networks , 1989, Neural Computation.

[6]  José Carlos Príncipe,et al.  A Theory for Neural Networks with Time Delays , 1990, NIPS.

[7]  Geoffrey E. Hinton,et al.  A time-delay neural network architecture for isolated word recognition , 1990, Neural Networks.

[8]  Sepp Hochreiter,et al.  Untersuchungen zu dynamischen neuronalen Netzen , 1991 .

[9]  Michael C. Mozer,et al.  Induction of Multiscale Temporal Structure , 1991, NIPS.

[10]  Jürgen Schmidhuber,et al.  Learning Complex, Extended Sequences Using the Principle of History Compression , 1992, Neural Computation.

[11]  Jürgen Schmidhuber,et al.  A Fixed Size Storage O(n3) Time Complexity Learning Algorithm for Fully Recurrent Continually Running Networks , 1992, Neural Computation.

[12]  Raymond L. Watrous,et al.  Induction of Finite-State Languages Using Second-Order Recurrent Networks , 1992, Neural Computation.

[13]  Guo-Zheng Sun,et al.  Time Warping Invariant Neural Networks , 1992, NIPS.

[14]  Tony Plate,et al.  Holographic Recurrent Networks , 1992, NIPS.

[15]  Yoshua Bengio,et al.  Credit Assignment through Time: Alternatives to Backpropagation , 1993, NIPS.

[16]  C. Lee Giles,et al.  Experimental Comparison of the Effect of Order in Recurrent Neural Networks , 1993, Int. J. Pattern Recognit. Artif. Intell..

[17]  Lee A. Feldkamp,et al.  Neurocontrol of nonlinear dynamical systems with Kalman filter trained recurrent networks , 1994, IEEE Trans. Neural Networks.

[18]  Yoshua Bengio,et al.  Learning long-term dependencies with gradient descent is difficult , 1994, IEEE Trans. Neural Networks.

[19]  Ronald J. Williams,et al.  Gradient-based learning algorithms for recurrent networks and their computational complexity , 1995 .

[20]  Barak A. Pearlmutter Gradient calculations for dynamic recurrent neural networks: a survey , 1995, IEEE Trans. Neural Networks.

[21]  Gustavo Deco,et al.  Nonlinear higher-order statistical decorrelation by volume-conserving neural architectures , 1995, Neural Networks.

[22]  Corso Elvezia Bridging Long Time Lags by Weight Guessing and \long Short Term Memory" , 1996 .

[23]  Sepp Hochreiter,et al.  Guessing can Outperform Many Long Time Lag Algorithms , 1996 .

[24]  Jürgen Schmidhuber,et al.  LSTM can Solve Hard Long Time Lag Problems , 1996, NIPS.

[25]  Peter Tiño,et al.  Learning long-term dependencies in NARX recurrent neural networks , 1996, IEEE Trans. Neural Networks.

[26]  Jürgen Schmidhuber,et al.  Long Short-Term Memory , 1997, Neural Computation.