Learning long-term dependencies in NARX recurrent neural networks

It has previously been shown that gradient-descent learning algorithms for recurrent neural networks can perform poorly on tasks that involve long-term dependencies, i.e. those problems for which the desired output depends on inputs presented at times far in the past. We show that the long-term dependencies problem is lessened for a class of architectures called nonlinear autoregressive models with exogenous (NARX) recurrent neural networks, which have powerful representational capabilities. We have previously reported that gradient descent learning can be more effective in NARX networks than in recurrent neural network architectures that have "hidden states" on problems including grammatical inference and nonlinear system identification. Typically, the network converges much faster and generalizes better than other networks. The results in this paper are consistent with this phenomenon. We present some experimental results which show that NARX networks can often retain information for two to three times as long as conventional recurrent neural networks. We show that although NARX networks do not circumvent the problem of long-term dependencies, they can greatly improve performance on long-term dependency problems. We also describe in detail some of the assumptions regarding what it means to latch information robustly and suggest possible ways to loosen these assumptions.

[1]  Thomas Kailath,et al.  Linear Systems , 1980 .

[2]  I. J. Leontaritis,et al.  Input-output parametric models for non-linear systems Part II: stochastic non-linear systems , 1985 .

[3]  Geoffrey E. Hinton,et al.  Learning internal representations by error propagation , 1986 .

[4]  J. Yorke,et al.  Pseudo-orbit shadowing in the family of tent maps , 1988 .

[5]  Jeffrey L. Elman,et al.  Finding Structure in Time , 1990, Cogn. Sci..

[6]  Stephen A. Billings,et al.  Non-linear system identification using neural networks , 1990 .

[7]  Les E. Atlas,et al.  Recurrent Networks and NARMA Modeling , 1991, NIPS.

[8]  R. R. Leighton,et al.  The autoregressive backpropagation algorithm , 1991, IJCNN-91-Seattle International Joint Conference on Neural Networks.

[9]  Jürgen Schmidhuber,et al.  Learning Unambiguous Reduced Sequence Descriptions , 1991, NIPS.

[10]  K. P. Unnikrishnan,et al.  Nonlinear prediction of speech signals using memory neuron networks , 1991, Neural Networks for Signal Processing Proceedings of the 1991 IEEE Workshop.

[11]  Hong-Te Su,et al.  Identification of Chemical Processes using Recurrent Networks , 1991, 1991 American Control Conference.

[12]  R. D. Lorenz,et al.  A structure by which a recurrent neural network can approximate a nonlinear dynamic system , 1991, IJCNN-91-Seattle International Joint Conference on Neural Networks.

[13]  Ah Chung Tsoi,et al.  FIR and IIR Synapses, a New Neural Network Architecture for Time Series Modeling , 1991, Neural Computation.

[14]  Michael C. Mozer,et al.  Induction of Multiscale Temporal Structure , 1991, NIPS.

[15]  Michael C. Mozer Neural network music composition and the induction of multiscale temporal structure , 1991, Wissensbasierte Systeme.

[16]  Jürgen Schmidhuber,et al.  Learning Complex, Extended Sequences Using the Principle of History Compression , 1992, Neural Computation.

[17]  S Z Qin,et al.  Comparison of four neural net learning methods for dynamic system identification , 1992, IEEE Trans. Neural Networks.

[18]  P. Werbos,et al.  Long-term predictions of chemical processes using recurrent neural networks: a parallel training approach , 1992 .

[19]  Eduardo Sontag Systems Combining Linearity and Saturations, and Relations of “Neural Nets” , 1992 .

[20]  Giovanni Soda,et al.  Local Feedback Multilayered Networks , 1992, Neural Computation.

[21]  José Carlos Príncipe,et al.  The gamma model--A new neural model for temporal processing , 1992, Neural Networks.

[22]  Pierre Roussel-Ragot,et al.  Neural Networks and Nonlinear Adaptive Filtering: Unifying Concepts and New Algorithms , 1993, Neural Computation.

[23]  Max H. Garzon,et al.  Observability of Neural Network Behavior , 1993, NIPS.

[24]  Eric A. Wan,et al.  Time series prediction by using a connectionist network with internal delay lines , 1993 .

[25]  Lee A. Feldkamp,et al.  Neurocontrol of nonlinear dynamical systems with Kalman filter trained recurrent networks , 1994, IEEE Trans. Neural Networks.

[26]  C. Lee Giles,et al.  An experimental comparison of recurrent neural networks , 1994, NIPS.

[27]  Yoshua Bengio,et al.  Learning long-term dependencies with gradient descent is difficult , 1994, IEEE Trans. Neural Networks.

[28]  Ronald J. Williams,et al.  Gradient-based learning algorithms for recurrent networks and their computational complexity , 1995 .

[29]  Eduardo Sontag,et al.  Computational power of neural networks , 1995 .

[30]  Hava T. Siegelmann,et al.  On the Computational Power of Neural Nets , 1995, J. Comput. Syst. Sci..

[31]  Yoshua Bengio,et al.  Hierarchical Recurrent Neural Networks for Long-Term Dependencies , 1995, NIPS.

[32]  Giovanni Soda,et al.  Unified Integration of Explicit Knowledge and Learning by Example in Recurrent Networks , 1995, IEEE Trans. Knowl. Data Eng..

[33]  Richard D. Braatz,et al.  On the "Identification and control of dynamical systems using neural networks" , 1997, IEEE Trans. Neural Networks.

[34]  Hava T. Siegelmann,et al.  Computational capabilities of recurrent NARX neural networks , 1997, IEEE Trans. Syst. Man Cybern. Part B.

[35]  Jürgen Schmidhuber,et al.  Long Short-Term Memory , 1997, Neural Computation.