Unsupervised Learning in LSTM Recurrent Neural Networks

While much work has been done on unsupervised learning in feedforward neural network architectures, its potential with (theoretically more powerful) recurrent networks and time-varying inputs has rarely been explored. Here we train Long Short-Term Memory (LSTM) recurrent networks to maximize two information-theoretic objectives for unsupervised learning: Binary Information Gain Optimization (BINGO) and Nonparametric Entropy Optimization (NEO). LSTM learns to discriminate different types of temporal sequences and group them according to a variety of features.

[1]  Paul A. Viola,et al.  Empirical Entropy Manipulation for Real-World Problems , 1995, NIPS.

[2]  Zhaoping Li,et al.  A Theory of the Visual Motion Coding in the Primary Visual Cortex , 1996, Neural Computation.

[3]  Eric Saund,et al.  Unsupervised Learning of Mixtures of Multiple Causes in Binary Data , 1993, NIPS.

[4]  Peter Dayan,et al.  Competition and Multiple Cause Models , 1995, Neural Comput..

[5]  Jürgen Schmidhuber,et al.  Learning Complex, Extended Sequences Using the Principle of History Compression , 1992, Neural Computation.

[6]  Jürgen Schmidhuber,et al.  Long Short-Term Memory , 1997, Neural Computation.

[7]  Juergen Schmidhuber,et al.  Long Short-Term Memory Learns Context Free and Context Sensitive Languages , 2000 .

[8]  David J. Field,et al.  What Is the Goal of Sensory Coding? , 1994, Neural Computation.

[9]  Jürgen Schmidhuber,et al.  Learning Unambiguous Reduced Sequence Descriptions , 1991, NIPS.

[10]  Andrzej Cichocki,et al.  A New Learning Algorithm for Blind Signal Separation , 1995, NIPS.

[11]  PAUL J. WERBOS,et al.  Generalization of backpropagation with application to a recurrent gas market model , 1988, Neural Networks.

[12]  Ralph Linsker,et al.  Self-organization in a perceptual network , 1988, Computer.

[13]  H. B. Barlow,et al.  Finding Minimum Entropy Codes , 1989, Neural Computation.

[14]  David J. Field,et al.  Emergence of simple-cell receptive field properties by learning a sparse code for natural images , 1996, Nature.

[15]  A. Norman Redlich,et al.  Redundancy Reduction as a Strategy for Unsupervised Learning , 1993, Neural Computation.

[16]  Nicole Norbert Schraudolph Optimization of entropy with neural networks , 1996 .

[17]  Christian Jutten,et al.  Blind separation of sources, part I: An adaptive algorithm based on neuromimetic architecture , 1991, Signal Process..

[18]  Jürgen Schmidhuber,et al.  Discovering Predictable Classifications , 1993, Neural Computation.

[19]  M. Mozer Discovering Discrete Distributed Representations with Iterative Competitive Learning , 1990, NIPS 1990.

[20]  Yoshua Bengio,et al.  Learning long-term dependencies with gradient descent is difficult , 1994, IEEE Trans. Neural Networks.

[21]  Suzanna Becker,et al.  Unsupervised Learning Procedures for Neural Networks , 1991, Int. J. Neural Syst..

[22]  Terrence J. Sejnowski,et al.  An Information-Maximization Approach to Blind Separation and Blind Deconvolution , 1995, Neural Computation.

[23]  Horace Barlow,et al.  Understanding Natural Vision , 1983 .

[24]  Jürgen Schmidhuber,et al.  Recurrent nets that time and count , 2000, Proceedings of the IEEE-INNS-ENNS International Joint Conference on Neural Networks. IJCNN 2000. Neural Computing: New Challenges and Perspectives for the New Millennium.

[25]  Jürgen Schmidhuber,et al.  Feature Extraction Through LOCOCODE , 1999, Neural Computation.

[26]  Peter Tiño,et al.  Building predictive models on complex symbolic sequences with a second-order recurrent BCM network with lateral inhibition , 2000, Proceedings of the IEEE-INNS-ENNS International Joint Conference on Neural Networks. IJCNN 2000. Neural Computing: New Challenges and Perspectives for the New Millennium.

[27]  Jürgen Schmidhuber Neural Predictors for Detecting and Removing Redundant Information , 2000 .

[28]  S. Hochreiter,et al.  Lococode Performs Nonlinear ICA Without Knowing The Number Of Sources , 1999 .

[29]  Terrence J. Sejnowski,et al.  Unsupervised Discrimination of Clustered Data via Optimization of Binary Information Gain , 1992, NIPS.

[30]  Pierre Comon,et al.  Independent component analysis, A new concept? , 1994, Signal Process..

[31]  Jürgen Schmidhuber,et al.  Learning Factorial Codes by Predictability Minimization , 1992, Neural Computation.

[32]  Günther Palm,et al.  On the Information Storage Capacity of Local Learning Rules , 1992, Neural Computation.

[33]  Stefanie N. Lindstaedt,et al.  Comparison of two Unsupervised Neural Network Models for Redundancy Reduction , 1993 .

[34]  R. Zemel A minimum description length framework for unsupervised learning , 1994 .

[35]  N N Schraudolph,et al.  Processing images by semi-linear predictability minimization. , 1997, Network.

[36]  Geoffrey E. Hinton,et al.  Generative models for discovering sparse distributed representations. , 1997, Philosophical transactions of the Royal Society of London. Series B, Biological sciences.

[37]  Jürgen Schmidhuber,et al.  LSTM recurrent networks learn simple context-free and context-sensitive languages , 2001, IEEE Trans. Neural Networks.

[38]  Peter Földiák,et al.  Sparse coding in the primate cortex , 1998 .

[39]  Néstor Parga,et al.  Redundancy Reduction and Independent Component Analysis: Conditions on Cumulants and Adaptive Approaches , 1997, Neural Computation.

[40]  Geoffrey E. Hinton,et al.  Learning Population Codes by Minimizing Description Length , 1993, Neural Computation.

[41]  Jürgen Schmidhuber,et al.  Learning to Forget: Continual Prediction with LSTM , 2000, Neural Computation.

[42]  Sepp Hochreiter,et al.  Untersuchungen zu dynamischen neuronalen Netzen , 1991 .

[43]  D J Field,et al.  Relations between the statistics of natural images and the response properties of cortical cells. , 1987, Journal of the Optical Society of America. A, Optics and image science.

[44]  Jürgen Schmidhuber,et al.  Source Separation as a By-Product of Regularization , 1998, NIPS.