Training Recurrent Neural Networks

Recurrent Neural Networks (RNNs) are powerful sequence models that were believed to be difficult to train, and as a result they were rarely used in machine learning applications. This thesis presents methods that overcome the difficulty of training RNNs, and applications of RNNs to challenging problems. We first describe a new probabilistic sequence model that combines Restricted Boltzmann Machines and RNNs. The new model is more powerful than similar models while being less difficult to train. Next, we present a new variant of the Hessian-free (HF) optimizer and show that it can train RNNs on tasks that have extreme long-range temporal dependencies, which were previously considered to be impossibly hard. We then apply HF to character-level language modelling and get excellent results. We also apply HF to optimal control and obtain RNN control laws that can successfully operate under conditions of delayed feedback and unknown disturbances. Finally, we describe a random parameter initialization scheme that allows gradient descent with momentum to train RNNs on problems with long-term dependencies. This directly contradicts widespread beliefs about the inability of first-order methods to do so, and suggests that previous attempts at training RNNs failed partly due to flaws in the random initialization.

[1]  Jorge J. Moré,et al.  The Levenberg-Marquardt algo-rithm: Implementation and theory , 1977 .

[2]  Geoffrey E. Hinton Relaxation and its role in vision , 1977 .

[3]  Glen G. Langdon,et al.  Arithmetic Coding , 1979, IBM J. Res. Dev..

[4]  Leslie G. Valiant,et al.  A theory of the learnable , 1984, STOC '84.

[5]  Geoffrey E. Hinton,et al.  Learning representations by back-propagating errors , 1986, Nature.

[6]  Geoffrey E. Hinton,et al.  Experiments on Learning by Back Propagation. , 1986 .

[7]  Paul Smolensky,et al.  Information processing in dynamical systems: foundations of harmony theory , 1986 .

[8]  Geoffrey E. Hinton,et al.  Phoneme recognition using time-delay neural networks , 1989, IEEE Trans. Acoust. Speech Signal Process..

[9]  Ronald J. Williams,et al.  A Learning Algorithm for Continually Running Fully Recurrent Neural Networks , 1989, Neural Computation.

[10]  Paul J. Werbos,et al.  Backpropagation Through Time: What It Does and How to Do It , 1990, Proc. IEEE.

[11]  Jing Peng,et al.  An Efficient Gradient-Based Algorithm for On-Line Training of Recurrent Network Trajectories , 1990, Neural Computation.

[12]  Jeffrey L. Elman,et al.  Finding Structure in Time , 1990, Cogn. Sci..

[13]  Sunil K. Agrawal,et al.  Inertia matrix singularity of planar series-chain manipulators , 1991, Proceedings. 1991 IEEE International Conference on Robotics and Automation.

[14]  J. Shewchuk An Introduction to the Conjugate Gradient Method Without the Agonizing Pain , 1994 .

[15]  Umesh V. Vazirani,et al.  An Introduction to Computational Learning Theory , 1994 .

[16]  Barak A. Pearlmutter Fast Exact Multiplication by the Hessian , 1994, Neural Computation.

[17]  Terrence J. Sejnowski,et al.  An Information-Maximization Approach to Blind Separation and Blind Deconvolution , 1995, Neural Computation.

[18]  Michael Isard,et al.  Contour Tracking by Stochastic Propagation of Conditional Density , 1996, ECCV.

[19]  Peter Tiño,et al.  Learning long-term dependencies in NARX recurrent neural networks , 1996, IEEE Trans. Neural Networks.

[20]  Jürgen Schmidhuber,et al.  Long Short-Term Memory , 1997, Neural Computation.

[21]  Yoshua Bengio,et al.  Global training of document processing systems using graph transformer networks , 1997, Proceedings of IEEE Computer Society Conference on Computer Vision and Pattern Recognition.

[22]  Yoshua Bengio,et al.  Gradient-based learning applied to document recognition , 1998, Proc. IEEE.

[23]  Sybert H. Stroeve,et al.  An analysis of learning control by backpropagation through time , 1998, Neural Networks.

[24]  Xavier Boyen,et al.  Tractable Inference for Complex Stochastic Processes , 1998, UAI.

[25]  Geoffrey E. Hinton,et al.  A View of the Em Algorithm that Justifies Incremental, Sparse, and other Variants , 1998, Learning in Graphical Models.

[26]  Stephen J. Wright,et al.  Numerical Optimization , 2018, Fundamental Statistical Inference.

[27]  Geoffrey E. Hinton,et al.  Variational Learning for Switching State-Space Models , 2000, Neural Computation.

[28]  Herbert Jaeger,et al.  Observable Operator Models for Discrete Stochastic Time Series , 2000, Neural Computation.

[29]  Vladimir N. Vapnik,et al.  The Nature of Statistical Learning Theory , 2000, Statistics for Engineering and Information Science.

[30]  E. J. Cheng,et al.  Morphometry of Macaca mulatta forelimb. I. Shoulder and elbow muscles and segment inertial parameters , 2000, Journal of morphology.

[31]  Alan F. Blackwell,et al.  Dasher—a data entry interface using continuous gestures and language models , 2000, UIST '00.

[32]  Radford M. Neal Annealed importance sampling , 1998, Stat. Comput..

[33]  S. Scott,et al.  Dissociation between hand motion and population vectors from neural activity in motor cortex , 2022 .

[34]  Geoffrey E. Hinton,et al.  Products of Hidden Markov Models , 2001, AISTATS.

[35]  Geoffrey E. Hinton Training Products of Experts by Minimizing Contrastive Divergence , 2002, Neural Computation.

[36]  Salim Roukos,et al.  Bleu: a Method for Automatic Evaluation of Machine Translation , 2002, ACL.

[37]  Nicol N. Schraudolph,et al.  Fast Curvature Matrix-Vector Products for Second-Order Gradient Descent , 2002, Neural Computation.

[38]  Jiping He,et al.  A novel model of motor learning capable of developing an optimal movement control law online from scratch , 2004, Biological Cybernetics.

[39]  J. Kalaska,et al.  Systematic changes in motor cortex cell activity with arm posture during directional isometric force generation. , 2003, Journal of neurophysiology.

[40]  Harald Haas,et al.  Harnessing Nonlinearity: Predicting Chaotic Systems and Saving Energy in Wireless Communication , 2004, Science.

[41]  Geoffrey E. Hinton,et al.  Exponential Family Harmoniums with an Application to Information Retrieval , 2004, NIPS.

[42]  Nicole Immorlica,et al.  Locality-sensitive hashing scheme based on p-stable distributions , 2004, SCG '04.

[43]  Emanuel Todorov,et al.  Iterative Linear Quadratic Regulator Design for Nonlinear Biological Movement Systems , 2004, ICINCO.

[44]  E. Todorov Optimality principles in sensorimotor control , 2004, Nature Neuroscience.

[45]  Michael I. Jordan,et al.  An Introduction to Variational Methods for Graphical Models , 1999, Machine Learning.

[46]  S. Scott Optimal feedback control and the neural basis of volitional motor control , 2004, Nature Reviews Neuroscience.

[47]  Michael I. Jordan,et al.  MASSACHUSETTS INSTITUTE OF TECHNOLOGY ARTIFICIAL INTELLIGENCE LABORATORY and CENTER FOR BIOLOGICAL AND COMPUTATIONAL LEARNING DEPARTMENT OF BRAIN AND COGNITIVE SCIENCES , 2001 .

[48]  Larry Wasserman,et al.  All of Statistics: A Concise Course in Statistical Inference , 2004 .

[49]  A.S. Willsky,et al.  Nonparametric belief propagation for self-calibration in sensor networks , 2004, Third International Symposium on Information Processing in Sensor Networks, 2004. IPSN 2004.

[50]  Yoshua Bengio,et al.  Greedy Layer-Wise Training of Deep Networks , 2006, NIPS.

[51]  Geoffrey E. Hinton,et al.  Reducing the Dimensionality of Data with Neural Networks , 2006, Science.

[52]  Geoffrey E. Hinton,et al.  Modeling Human Motion Using Binary Latent Variables , 2006, NIPS.

[53]  Yee Whye Teh,et al.  A Fast Learning Algorithm for Deep Belief Nets , 2006, Neural Computation.

[54]  Jürgen Schmidhuber,et al.  Connectionist temporal classification: labelling unsegmented sequence data with recurrent neural networks , 2006, ICML.

[55]  Jürgen Schmidhuber,et al.  A System for Robotic Heart Surgery that Learns to Tie Knots Using Recurrent Neural Networks , 2006 .

[56]  Takashi Komeda,et al.  REINFORCEMENT LEARNING FOR POMDP USING STATE CLASSIFICATION , 2007, MLMTA.

[57]  Geoffrey E. Hinton,et al.  Visualizing Similarity Data with a Mixture of Maps , 2007, AISTATS.

[58]  Jürgen Schmidhuber,et al.  Unconstrained On-line Handwriting Recognition with Recurrent Neural Networks , 2007, NIPS.

[59]  Jürgen Schmidhuber,et al.  Policy Gradient Critics , 2007, ECML.

[60]  Yoshua. Bengio,et al.  Learning Deep Architectures for AI , 2007, Found. Trends Mach. Learn..

[61]  Geoffrey E. Hinton,et al.  The Recurrent Temporal Restricted Boltzmann Machine , 2008, NIPS.

[62]  Ruslan Salakhutdinov,et al.  On the quantitative analysis of deep belief networks , 2008, ICML '08.

[63]  Geoffrey E. Hinton,et al.  Deep, Narrow Sigmoid Belief Networks Are Universal Approximators , 2008, Neural Computation.

[64]  Ilya Sutskever,et al.  Mimicking Go Experts with Convolutional Neural Networks , 2008, ICANN.

[65]  Geoffrey E. Hinton,et al.  A Scalable Hierarchical Distributed Language Model , 2008, NIPS.

[66]  Michael I. Jordan,et al.  Graphical Models, Exponential Families, and Variational Inference , 2008, Found. Trends Mach. Learn..

[67]  Geoffrey E. Hinton,et al.  Using matrices to model symbolic relationship , 2008, NIPS.

[68]  Geoffrey E. Hinton,et al.  Using fast weights to improve persistent contrastive divergence , 2009, ICML '09.

[69]  Yann LeCun,et al.  What is the best multi-stage architecture for object recognition? , 2009, 2009 IEEE 12th International Conference on Computer Vision.

[70]  Geoffrey E. Hinton,et al.  Factored conditional restricted Boltzmann Machines for modeling motion style , 2009, ICML '09.

[71]  Joshua B. Tenenbaum,et al.  Modelling Relational Data using Bayesian Clustered Tensor Factorization , 2009, NIPS.

[72]  Yee Whye Teh,et al.  A stochastic memoizer for sequence data , 2009, ICML '09.

[73]  Emanuel Todorov,et al.  Real-time motor control using recurrent neural networks , 2009, 2009 IEEE Symposium on Adaptive Dynamic Programming and Reinforcement Learning.

[74]  Geoffrey E. Hinton,et al.  Semantic hashing , 2009, Int. J. Approx. Reason..

[75]  James Martens,et al.  Deep learning via Hessian-free optimization , 2010, ICML.

[76]  Ilya Sutskever,et al.  On the Convergence Properties of Contrastive Divergence , 2010, AISTATS.

[77]  Lukás Burget,et al.  Recurrent neural network based language model , 2010, INTERSPEECH.

[78]  Yee Whye Teh,et al.  Lossless Compression Based on the Sequence Memoizer , 2010, 2010 Data Compression Conference.

[79]  Geoffrey E. Hinton,et al.  Temporal-Kernel Recurrent Neural Networks , 2010, Neural Networks.

[80]  Rocco A. Servedio,et al.  Restricted Boltzmann Machines are Hard to Approximately Evaluate or Simulate , 2010, ICML.

[81]  Lukás Burget,et al.  Extensions of recurrent neural network language model , 2011, 2011 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[82]  Ilya Sutskever,et al.  Learning Recurrent Neural Networks with Hessian-Free Optimization , 2011, ICML.

[83]  Ilya Sutskever,et al.  Data Normalization in the Learning of Restricted Boltzmann Machines , 2011 .

[84]  Zhenghao Chen,et al.  On Random Weights and Unsupervised Feature Learning , 2011, ICML.

[85]  Yoshua Bengio,et al.  Domain Adaptation for Large-Scale Sentiment Classification: A Deep Learning Approach , 2011, ICML.

[86]  Geoffrey E. Hinton,et al.  Using very deep autoencoders for content-based image retrieval , 2011, ESANN.

[87]  Geoffrey E. Hinton,et al.  Generating Text with Recurrent Neural Networks , 2011, ICML.

[88]  Nitish Srivastava,et al.  Improving neural networks by preventing co-adaptation of feature detectors , 2012, ArXiv.

[89]  Yoshua Bengio,et al.  Random Search for Hyper-Parameter Optimization , 2012, J. Mach. Learn. Res..

[90]  Yoshua Bengio,et al.  Modeling Temporal Dependencies in High-Dimensional Sequences: Application to Polyphonic Music Generation and Transcription , 2012, ICML.

[91]  Ryan P. Adams,et al.  Cardinality Restricted Boltzmann Machines , 2012, NIPS.

[92]  Herbert Jaeger,et al.  Long Short-Term Memory in Echo State Networks: Details of a Simulation Study , 2012 .

[93]  Dong Yu,et al.  Context-Dependent Pre-Trained Deep Neural Networks for Large-Vocabulary Speech Recognition , 2012, IEEE Transactions on Audio, Speech, and Language Processing.

[94]  Geoffrey E. Hinton,et al.  Acoustic Modeling Using Deep Belief Networks , 2012, IEEE Transactions on Audio, Speech, and Language Processing.

[95]  Ilya Sutskever,et al.  Estimating the Hessian by Back-propagating Curvature , 2012, ICML.