LipNet: Sentence-level Lipreading

Lipreading is the task of decoding text from the movement of a speaker’s mouth. Traditional approaches separated the problem into two stages: designing or learning visual features, and prediction. More recent deep lipreading approaches are end-to-end trainable (Wand et al., 2016; Chung & Zisserman, 2016a). All existing works, however, perform only word classification, not sentence-level sequence prediction. Studies have shown that human lipreading performance increases for longer words (Easton & Basala, 1982), indicating the importance of features capturing temporal context in an ambiguous communication channel. Motivated by this observation, we present LipNet, a model that maps a variable-length sequence of video frames to text, making use of spatiotemporal convolutions, an LSTM recurrent network, and the connectionist temporal classification loss, trained entirely end-to-end. To the best of our knowledge, LipNet is the first lipreading model to operate at sentence-level, using a single end-to-end speaker-independent deep model to simultaneously learn spatiotemporal visual features and a sequence model. On the GRID corpus, LipNet achieves 93.4% accuracy, outperforming experienced human lipreaders and the previous 79.6% state-of-the-art accuracy.

[1]  Rob Fergus,et al.  Visualizing and Understanding Convolutional Networks , 2013, ECCV.

[2]  Jürgen Schmidhuber,et al.  Framewise phoneme classification with bidirectional LSTM and other neural network architectures , 2005, Neural Networks.

[3]  Johan A. du Preez,et al.  Audio-Visual Speech Recognition using SciPy , 2010 .

[4]  Shuicheng Yan,et al.  Classification and Feature Extraction by Simplexization , 2008, IEEE Transactions on Information Forensics and Security.

[5]  Emmanuel Ferragne,et al.  Formant frequencies of vowels in 13 accents of the British Isles , 2010, Journal of the International Phonetic Association.

[6]  Stephen J. Cox,et al.  Improved speaker independent lip reading using speaker adaptive training and deep neural networks , 2016, 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[7]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[8]  Yoshua Bengio,et al.  Understanding the difficulty of training deep feedforward neural networks , 2010, AISTATS.

[9]  Matti Pietikäinen,et al.  A review of recent advances in visual speech decoding , 2014, Image Vis. Comput..

[10]  Petros Maragos,et al.  Adaptive multimodal fusion by uncertainty compensation , 2006, INTERSPEECH.

[11]  F. Deland,et al.  The story of lip-reading : its genesis and development , 1968 .

[12]  Fei-Fei Li,et al.  Large-Scale Video Classification with Convolutional Neural Networks , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[13]  Tetsuya Takiguchi,et al.  Audio-Visual Speech Recognition Using Bimodal-Trained Bottleneck Features for a Person with Severe Hearing Loss , 2016, INTERSPEECH.

[14]  H. McGurk,et al.  Hearing lips and seeing voices , 1976, Nature.

[15]  Thomas Brox,et al.  Striving for Simplicity: The All Convolutional Net , 2014, ICLR.

[16]  Timothy F. Cootes,et al.  Extraction of Visual Features for Lipreading , 2002, IEEE Trans. Pattern Anal. Mach. Intell..

[17]  Jon Barker,et al.  An audio-visual corpus for speech perception and automatic speech recognition. , 2006, The Journal of the Acoustical Society of America.

[18]  Daniel Jurafsky,et al.  Lexicon-Free Conversational Speech Recognition with Neural Networks , 2015, NAACL.

[19]  Sridha Sridharan,et al.  Patch-Based Representation of Visual Speech , 2006 .

[20]  Geoffrey E. Hinton,et al.  ImageNet classification with deep convolutional neural networks , 2012, Commun. ACM.

[21]  Petros Maragos,et al.  Multimodal Fusion and Learning with Uncertain Features Applied to Audiovisual Speech Recognition , 2007, 2007 IEEE 9th Workshop on Multimedia Signal Processing.

[22]  Joon Son Chung,et al.  Out of Time: Automated Lip Sync in the Wild , 2016, ACCV Workshops.

[23]  Amit Garg amit,et al.  Lip reading using CNN and LSTM , 2016 .

[24]  Barry-John Theobald,et al.  Comparison of human and machine-based lip-reading , 2009, AVSP.

[25]  Hermann Ney,et al.  Deep Learning of Mouth Shapes for Sign Language , 2015, 2015 IEEE International Conference on Computer Vision Workshop (ICCVW).

[26]  Stefanos Zafeiriou,et al.  300 Faces in-the-Wild Challenge: The First Facial Landmark Localization Challenge , 2013, 2013 IEEE International Conference on Computer Vision Workshops.

[27]  Navdeep Jaitly,et al.  Towards End-To-End Speech Recognition with Recurrent Neural Networks , 2014, ICML.

[28]  M. Woodward,et al.  Phoneme perception in lipreading. , 1960, Journal of speech and hearing research.

[29]  Jürgen Schmidhuber,et al.  Connectionist temporal classification: labelling unsegmented sequence data with recurrent neural networks , 2006, ICML.

[30]  R. D. Easton,et al.  Perceptual dominance during lipreading , 1982, Perception & psychophysics.

[31]  Jean-Philippe Thiran,et al.  Information Theoretic Feature Extraction for Audio-Visual Speech Recognition , 2009, IEEE Transactions on Signal Processing.

[32]  Tara N. Sainath,et al.  FUNDAMENTAL TECHNOLOGIES IN MODERN SPEECH RECOGNITION Digital Object Identifier 10.1109/MSP.2012.2205597 , 2012 .

[33]  C. G. Fisher,et al.  Confusions among visually perceived consonants. , 1968, Journal of speech and hearing research.

[34]  Dong Yu,et al.  Context-Dependent Pre-Trained Deep Neural Networks for Large-Vocabulary Speech Recognition , 2012, IEEE Transactions on Audio, Speech, and Language Processing.

[35]  Davis E. King,et al.  Dlib-ml: A Machine Learning Toolkit , 2009, J. Mach. Learn. Res..

[36]  Ming Yang,et al.  3D Convolutional Neural Networks for Human Action Recognition , 2010, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[37]  Andrew Zisserman,et al.  Deep Inside Convolutional Networks: Visualising Image Classification Models and Saliency Maps , 2013, ICLR.

[38]  Chong Wang,et al.  Deep Speech 2 : End-to-End Speech Recognition in English and Mandarin , 2015, ICML.

[39]  Juhan Nam,et al.  Multimodal Deep Learning , 2011, ICML.

[40]  Matti Pietikäinen,et al.  This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. IEEE TRANSACTIONS ON MULTIMEDIA 1 Lipreading with Local Spatiotemporal Descriptors , 2022 .

[41]  A. Cruttenden Gimson's Pronunciation of English , 1994 .

[42]  Jürgen Schmidhuber,et al.  Lipreading with long short-term memory , 2016, 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[43]  Joon Son Chung,et al.  Lip Reading in the Wild , 2016, ACCV.

[44]  Jürgen Schmidhuber,et al.  Long Short-Term Memory , 1997, Neural Computation.

[45]  G. D. Magoulas,et al.  Under review as a conference paper at ICLR 2017 , 2022 .

[46]  Tetsuya Ogata,et al.  Lipreading using convolutional neural network , 2014, INTERSPEECH.

[47]  Petros Maragos,et al.  Adaptive Multimodal Fusion by Uncertainty Compensation With Application to Audiovisual Speech Recognition , 2009, IEEE Transactions on Audio, Speech, and Language Processing.