Blending LSTMs into CNNs

We consider whether deep convolutional networks (CNNs) can represent decision functions with similar accuracy as recurrent networks such as LSTMs. First, we show that a deep CNN with an architecture inspired by the models recently introduced in image recognition can yield better accuracy than previous convolutional and LSTM networks on the standard 309h Switchboard automatic speech recognition task. Then we show that even more accurate CNNs can be trained under the guidance of LSTMs using a variant of model compression, which we call model blending because the teacher and student models are similar in complexity but different in inductive bias. Blending further improves the accuracy of our CNN, yielding a computationally efficient model of accuracy higher than any of the other individual models. Examining the effect of "dark knowledge" in this model compression task, we find that less than 1% of the highest probability labels are needed for accurate model compression.

[1]  Kuldip K. Paliwal,et al.  Bidirectional recurrent neural networks , 1997, IEEE Trans. Signal Process..

[2]  Razvan Pascanu,et al.  Theano: new features and speed improvements , 2012, ArXiv.

[3]  Misha Denil,et al.  Predicting Parameters in Deep Learning , 2014 .

[4]  Matthew Richardson,et al.  Do Deep Convolutional Nets Really Need to be Deep (Or Even Convolutional)? , 2016, ArXiv.

[5]  Jürgen Schmidhuber,et al.  Long Short-Term Memory , 1997, Neural Computation.

[6]  Alex Graves,et al.  Supervised Sequence Labelling with Recurrent Neural Networks , 2012, Studies in Computational Intelligence.

[7]  Lukás Burget,et al.  Sequence-discriminative training of deep neural networks , 2013, INTERSPEECH.

[8]  Yoshua Bengio,et al.  FitNets: Hints for Thin Deep Nets , 2014, ICLR.

[9]  Rich Caruana,et al.  Model compression , 2006, KDD '06.

[10]  Yoshua Bengio,et al.  Convolutional networks for images, speech, and time series , 1998 .

[11]  Andrew W. Senior,et al.  Long Short-Term Memory Based Recurrent Neural Network Architectures for Large Vocabulary Speech Recognition , 2014, ArXiv.

[12]  Gerald Penn,et al.  Convolutional Neural Networks for Speech Recognition , 2014, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[13]  Geoffrey Zweig,et al.  Deep bi-directional recurrent networks over spectral windows , 2015, 2015 IEEE Workshop on Automatic Speech Recognition and Understanding (ASRU).

[14]  Geoffrey E. Hinton,et al.  Speech recognition with deep recurrent neural networks , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[15]  Tara N. Sainath,et al.  Deep convolutional neural networks for LVCSR , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[16]  S. J. Young,et al.  Tree-based state tying for high accuracy acoustic modelling , 1994 .

[17]  Yann LeCun,et al.  Optimal Brain Damage , 1989, NIPS.

[18]  Tara N. Sainath,et al.  Deep Convolutional Neural Networks for Large-scale Speech Tasks , 2015, Neural Networks.

[19]  Zhiyuan Tang,et al.  Recurrent neural network training with dark knowledge transfer , 2016, 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[20]  William Chan,et al.  Transferring knowledge from a RNN to a DNN , 2015, INTERSPEECH.

[21]  Tara N. Sainath,et al.  Convolutional, Long Short-Term Memory, fully connected Deep Neural Networks , 2015, 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[22]  Geoffrey E. Hinton,et al.  Distilling the Knowledge in a Neural Network , 2015, ArXiv.

[23]  Yoshua Bengio,et al.  Big Neural Networks Waste Capacity , 2013, ICLR.

[24]  Jürgen Schmidhuber,et al.  Learning Precise Timing with LSTM Recurrent Networks , 2003, J. Mach. Learn. Res..

[25]  Rich Caruana,et al.  Do Deep Nets Really Need to be Deep? , 2013, NIPS.

[26]  Song Han,et al.  Deep Compression: Compressing Deep Neural Network with Pruning, Trained Quantization and Huffman Coding , 2015, ICLR.

[27]  Quoc V. Le,et al.  Sequence to Sequence Learning with Neural Networks , 2014, NIPS.

[28]  John J. Godfrey,et al.  SWITCHBOARD: telephone speech corpus for research and development , 1992, [Proceedings] ICASSP-92: 1992 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[29]  Yoshua Bengio,et al.  Gradient-based learning applied to document recognition , 1998, Proc. IEEE.

[30]  Jürgen Schmidhuber,et al.  Framewise phoneme classification with bidirectional LSTM and other neural network architectures , 2005, Neural Networks.

[31]  Jürgen Schmidhuber,et al.  Training Very Deep Networks , 2015, NIPS.

[32]  Dong Wang,et al.  Knowledge Transfer Pre-training , 2015, ArXiv.

[33]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[34]  Alex Graves,et al.  Generating Sequences With Recurrent Neural Networks , 2013, ArXiv.

[35]  Geoffrey E. Hinton,et al.  Grammar as a Foreign Language , 2014, NIPS.

[36]  Brian Kingsbury,et al.  Lattice-based optimization of sequence classification criteria for neural-network acoustic modeling , 2009, 2009 IEEE International Conference on Acoustics, Speech and Signal Processing.

[37]  Yifan Gong,et al.  Learning small-size DNN with output-distribution-based criteria , 2014, INTERSPEECH.

[38]  Andrew Zisserman,et al.  Very Deep Convolutional Networks for Large-Scale Image Recognition , 2014, ICLR.

[39]  Honglak Lee,et al.  Unsupervised feature learning for audio classification using convolutional deep belief networks , 2009, NIPS.

[40]  Gerald Penn,et al.  Applying Convolutional Neural Networks concepts to hybrid NN-HMM model for speech recognition , 2012, 2012 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[41]  Thomas G. Dietterich Multiple Classifier Systems , 2000, Lecture Notes in Computer Science.