论文信息 - LipNet: Sentence-level Lipreading

LipNet: Sentence-level Lipreading

Lipreading is the task of decoding text from the movement of a speaker’s mouth. Traditional approaches separated the problem into two stages: designing or learning visual features, and prediction. More recent deep lipreading approaches are end-to-end trainable (Wand et al., 2016; Chung & Zisserman, 2016a). All existing works, however, perform only word classification, not sentence-level sequence prediction. Studies have shown that human lipreading performance increases for longer words (Easton & Basala, 1982), indicating the importance of features capturing temporal context in an ambiguous communication channel. Motivated by this observation, we present LipNet, a model that maps a variable-length sequence of video frames to text, making use of spatiotemporal convolutions, an LSTM recurrent network, and the connectionist temporal classification loss, trained entirely end-to-end. To the best of our knowledge, LipNet is the first lipreading model to operate at sentence-level, using a single end-to-end speaker-independent deep model to simultaneously learn spatiotemporal visual features and a sequence model. On the GRID corpus, LipNet achieves 93.4% accuracy, outperforming experienced human lipreaders and the previous 79.6% state-of-the-art accuracy.

[1] Rob Fergus,et al. Visualizing and Understanding Convolutional Networks , 2013, ECCV.

[2] Jürgen Schmidhuber,et al. Framewise phoneme classification with bidirectional LSTM and other neural network architectures , 2005, Neural Networks.

[3] Johan A. du Preez,et al. Audio-Visual Speech Recognition using SciPy , 2010 .

[4] Shuicheng Yan,et al. Classification and Feature Extraction by Simplexization , 2008, IEEE Transactions on Information Forensics and Security.

[5] Emmanuel Ferragne,et al. Formant frequencies of vowels in 13 accents of the British Isles , 2010, Journal of the International Phonetic Association.

[6] Stephen J. Cox,et al. Improved speaker independent lip reading using speaker adaptive training and deep neural networks , 2016, 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[7] Jimmy Ba,et al. Adam: A Method for Stochastic Optimization , 2014, ICLR.

[8] Yoshua Bengio,et al. Understanding the difficulty of training deep feedforward neural networks , 2010, AISTATS.

[9] Matti Pietikäinen,et al. A review of recent advances in visual speech decoding , 2014, Image Vis. Comput..

[10] Petros Maragos,et al. Adaptive multimodal fusion by uncertainty compensation , 2006, INTERSPEECH.

[11] F. Deland,et al. The story of lip-reading : its genesis and development , 1968 .

[12] Fei-Fei Li,et al. Large-Scale Video Classification with Convolutional Neural Networks , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[13] Tetsuya Takiguchi,et al. Audio-Visual Speech Recognition Using Bimodal-Trained Bottleneck Features for a Person with Severe Hearing Loss , 2016, INTERSPEECH.

[14] H. McGurk,et al. Hearing lips and seeing voices , 1976, Nature.

[15] Thomas Brox,et al. Striving for Simplicity: The All Convolutional Net , 2014, ICLR.

[16] Timothy F. Cootes,et al. Extraction of Visual Features for Lipreading , 2002, IEEE Trans. Pattern Anal. Mach. Intell..

[17] Jon Barker,et al. An audio-visual corpus for speech perception and automatic speech recognition. , 2006, The Journal of the Acoustical Society of America.

[18] Daniel Jurafsky,et al. Lexicon-Free Conversational Speech Recognition with Neural Networks , 2015, NAACL.

[19] Sridha Sridharan,et al. Patch-Based Representation of Visual Speech , 2006 .

[20] Geoffrey E. Hinton,et al. ImageNet classification with deep convolutional neural networks , 2012, Commun. ACM.

[21] Petros Maragos,et al. Multimodal Fusion and Learning with Uncertain Features Applied to Audiovisual Speech Recognition , 2007, 2007 IEEE 9th Workshop on Multimedia Signal Processing.

[22] Joon Son Chung,et al. Out of Time: Automated Lip Sync in the Wild , 2016, ACCV Workshops.

[23] Amit Garg amit,et al. Lip reading using CNN and LSTM , 2016 .

[24] Barry-John Theobald,et al. Comparison of human and machine-based lip-reading , 2009, AVSP.

[25] Hermann Ney,et al. Deep Learning of Mouth Shapes for Sign Language , 2015, 2015 IEEE International Conference on Computer Vision Workshop (ICCVW).

[26] Stefanos Zafeiriou,et al. 300 Faces in-the-Wild Challenge: The First Facial Landmark Localization Challenge , 2013, 2013 IEEE International Conference on Computer Vision Workshops.

[27] Navdeep Jaitly,et al. Towards End-To-End Speech Recognition with Recurrent Neural Networks , 2014, ICML.

[28] M. Woodward,et al. Phoneme perception in lipreading. , 1960, Journal of speech and hearing research.

[29] Jürgen Schmidhuber,et al. Connectionist temporal classification: labelling unsegmented sequence data with recurrent neural networks , 2006, ICML.

[30] R. D. Easton,et al. Perceptual dominance during lipreading , 1982, Perception & psychophysics.

[31] Jean-Philippe Thiran,et al. Information Theoretic Feature Extraction for Audio-Visual Speech Recognition , 2009, IEEE Transactions on Signal Processing.

[32] Tara N. Sainath,et al. FUNDAMENTAL TECHNOLOGIES IN MODERN SPEECH RECOGNITION Digital Object Identifier 10.1109/MSP.2012.2205597 , 2012 .

[33] C. G. Fisher,et al. Confusions among visually perceived consonants. , 1968, Journal of speech and hearing research.

[34] Dong Yu,et al. Context-Dependent Pre-Trained Deep Neural Networks for Large-Vocabulary Speech Recognition , 2012, IEEE Transactions on Audio, Speech, and Language Processing.

[35] Davis E. King,et al. Dlib-ml: A Machine Learning Toolkit , 2009, J. Mach. Learn. Res..

[36] Ming Yang,et al. 3D Convolutional Neural Networks for Human Action Recognition , 2010, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[37] Andrew Zisserman,et al. Deep Inside Convolutional Networks: Visualising Image Classification Models and Saliency Maps , 2013, ICLR.

[38] Chong Wang,et al. Deep Speech 2 : End-to-End Speech Recognition in English and Mandarin , 2015, ICML.

[39] Juhan Nam,et al. Multimodal Deep Learning , 2011, ICML.

[40] Matti Pietikäinen,et al. This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. IEEE TRANSACTIONS ON MULTIMEDIA 1 Lipreading with Local Spatiotemporal Descriptors , 2022 .

[41] A. Cruttenden. Gimson's Pronunciation of English , 1994 .

[42] Jürgen Schmidhuber,et al. Lipreading with long short-term memory , 2016, 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[43] Joon Son Chung,et al. Lip Reading in the Wild , 2016, ACCV.

[44] Jürgen Schmidhuber,et al. Long Short-Term Memory , 1997, Neural Computation.

[45] G. D. Magoulas,et al. Under review as a conference paper at ICLR 2017 , 2022 .

[46] Tetsuya Ogata,et al. Lipreading using convolutional neural network , 2014, INTERSPEECH.

[47] Petros Maragos,et al. Adaptive Multimodal Fusion by Uncertainty Compensation With Application to Audiovisual Speech Recognition , 2009, IEEE Transactions on Audio, Speech, and Language Processing.