Long Short-Term Memory

Learning to store information over extended time intervals by recurrent backpropagation takes a very long time, mostly because of insufficient, decaying error backflow. We briefly review Hochreiter's (1991) analysis of this problem, then address it by introducing a novel, efficient, gradient based method called long short-term memory (LSTM). Truncating the gradient where this does not do harm, LSTM can learn to bridge minimal time lags in excess of 1000 discrete-time steps by enforcing constant error flow through constant error carousels within special units. Multiplicative gate units learn to open and close access to the constant error flow. LSTM is local in space and time; its computational complexity per time step and weight is O. 1. Our experiments with artificial data involve local, distributed, real-valued, and noisy pattern representations. In comparisons with real-time recurrent learning, back propagation through time, recurrent cascade correlation, Elman nets, and neural sequence chunking, LSTM leads to many more successful runs, and learns much faster. LSTM also solves complex, artificial long-time-lag tasks that have never been solved by previous recurrent network algorithms.

[1]  Pineda Generalization of back-propagation to recurrent neural networks. , 1987, Physical review letters.

[2]  PAUL J. WERBOS,et al.  Generalization of backpropagation with application to a recurrent gas market model , 1988, Neural Networks.

[3]  Fernando J. Pineda,et al.  Dynamics and architecture for neural computation , 1988, J. Complex..

[4]  Barak A. Pearlmutter Learning State Space Trajectories in Recurrent Neural Networks , 1989, Neural Computation.

[5]  Jürgen Schmidhuber,et al.  A Local Learning Algorithm for Dynamic Feedforward and Recurrent Networks , 1989 .

[6]  David Zipser,et al.  Learning Sequential Structure with the Real-Time Recurrent Learning Algorithm , 1991, Int. J. Neural Syst..

[7]  Ronald J. Williams,et al.  Gradient-Based Learning Algorithms for Recurrent Networks , 1989 .

[8]  James L. McClelland,et al.  Finite State Automata and Simple Recurrent Networks , 1989, Neural Computation.

[9]  Kevin J. Lang A time delay neural network architecture for speech recognition , 1989 .

[10]  Kenji Doya,et al.  Adaptive neural oscillator using continuous-time back-propagation learning , 1989, Neural Networks.

[11]  Jing Peng,et al.  An Efficient Gradient-Based Algorithm for On-Line Training of Recurrent Network Trajectories , 1990, Neural Computation.

[12]  Jordan B. Pollack,et al.  Language Induction by Phase Transition in Dynamical Recognizers , 1990, NIPS.

[13]  Jeffrey L. Elman,et al.  Finding Structure in Time , 1990, Cogn. Sci..

[14]  L. B. Almeida A learning rule for asynchronous perceptrons with feedback in a combinatorial environment , 1990 .

[15]  Scott E. Fahlman,et al.  The Recurrent Cascade-Correlation Architecture , 1990, NIPS.

[16]  José Carlos Príncipe,et al.  A Theory for Neural Networks with Time Delays , 1990, NIPS.

[17]  Geoffrey E. Hinton,et al.  A time-delay neural network architecture for isolated word recognition , 1990, Neural Networks.

[18]  Jürgen Schmidhuber Learning Unambiguous Reduced Sequence Descriptions , 1991, NIPS.

[19]  Sepp Hochreiter,et al.  Untersuchungen zu dynamischen neuronalen Netzen , 1991 .

[20]  Pierre Baldi,et al.  Contrastive Learning and Neural Oscillations , 1991, Neural Computation.

[21]  Michael C. Mozer,et al.  Induction of Multiscale Temporal Structure , 1991, NIPS.

[22]  Jürgen Schmidhuber,et al.  Learning Complex, Extended Sequences Using the Principle of History Compression , 1992, Neural Computation.

[23]  Jürgen Schmidhuber,et al.  A Fixed Size Storage O(n3) Time Complexity Learning Algorithm for Fully Recurrent Continually Running Networks , 1992, Neural Computation.

[24]  Raymond L. Watrous,et al.  Induction of Finite-State Languages Using Second-Order Recurrent Networks , 1992, Neural Computation.

[25]  Guo-Zheng Sun,et al.  Time Warping Invariant Neural Networks , 1992, NIPS.

[26]  Mark B. Ring Learning Sequential Tasks by Incrementally Adding Higher Orders , 1992, NIPS.

[27]  Tony Plate,et al.  Holographic Recurrent Networks , 1992, NIPS.

[28]  K. Doya,et al.  Bifurcations in the learning of recurrent neural networks , 1992, [Proceedings] 1992 IEEE International Symposium on Circuits and Systems.

[29]  Yoshua Bengio,et al.  Credit Assignment through Time: Alternatives to Backpropagation , 1993, NIPS.

[30]  C. Lee Giles,et al.  Experimental Comparison of the Effect of Order in Recurrent Neural Networks , 1993, Int. J. Pattern Recognit. Artif. Intell..

[31]  Jürgen Schmidhuber,et al.  Continuous history compression , 1993 .

[32]  Jürgen Schmidhuber,et al.  Netzwerkarchitekturen, Zielfunktionen und Kettenregel , 1993 .

[33]  Lee A. Feldkamp,et al.  Neurocontrol of nonlinear dynamical systems with Kalman filter trained recurrent networks , 1994, IEEE Trans. Neural Networks.

[34]  Yoshua Bengio,et al.  Learning long-term dependencies with gradient descent is difficult , 1994, IEEE Trans. Neural Networks.

[35]  Peter Tiňo,et al.  Learning long-term dependencies is not as difficult with NARX recurrent neural networks , 1995 .

[36]  Ronald J. Williams,et al.  Gradient-based learning algorithms for recurrent networks and their computational complexity , 1995 .

[37]  Barak A. Pearlmutter Gradient calculations for dynamic recurrent neural networks: a survey , 1995, IEEE Trans. Neural Networks.

[38]  Corso Elvezia Bridging Long Time Lags by Weight Guessing and \long Short Term Memory" , 1996 .

[39]  Sepp Hochreiter,et al.  Guessing can Outperform Many Long Time Lag Algorithms , 1996 .

[40]  Jürgen Schmidhuber,et al.  LSTM can Solve Hard Long Time Lag Problems , 1996, NIPS.

[41]  Peter Tiño,et al.  Learning long-term dependencies in NARX recurrent neural networks , 1996, IEEE Trans. Neural Networks.