Large-Scale Visual Speech Recognition

This work presents a scalable solution to open-vocabulary visual speech recognition. To achieve this, we constructed the largest existing visual speech recognition dataset, consisting of pairs of text and video clips of faces speaking (3,886 hours of video). In tandem, we designed and trained an integrated lipreading system, consisting of a video processing pipeline that maps raw video to stable videos of lips and sequences of phonemes, a scalable deep neural network that maps the lip videos to sequences of phoneme distributions, and a production-level speech decoder that outputs sequences of words. The proposed system achieves a word error rate (WER) of 40.9% as measured on a held-out set. In comparison, professional lipreaders achieve either 86.4% or 92.9% WER on the same dataset when having access to additional types of contextual information. Our approach significantly improves on other lipreading approaches, including variants of LipNet and of Watch, Attend, and Spell (WAS), which are only capable of 89.8% and 76.8% WER respectively.

[1]  Alex Pentland,et al.  Automatic lipreading by optical-flow analysis , 1989 .

[2]  Alan Jeffrey Goldschen,et al.  Continuous automatic speech recognition by lipreading , 1993 .

[3]  John C. Wells,et al.  Computer-coding the IPA: a proposed extension of SAMPA , 1995 .

[4]  Javier R. Movellan,et al.  Dynamic Features for Visual Speechreading: A Systematic Comparison , 1996, NIPS.

[5]  Jürgen Schmidhuber,et al.  Long Short-Term Memory , 1997, Neural Computation.

[6]  Gerasimos Potamianos,et al.  Speaker independent audio-visual database for bimodal ASR , 1997, AVSP.

[7]  Gerasimos Potamianos,et al.  An image transform approach for HMM based automatic lipreading , 1998, Proceedings 1998 International Conference on Image Processing. ICIP98 (Cat. No.98CB36269).

[8]  Gerasimos Potamianos,et al.  Discriminative training of HMM stream exponents for audio-visual speech recognition , 1998, Proceedings of the 1998 IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP '98 (Cat. No.98CH36181).

[9]  Thomas S. Huang,et al.  Bimodal speech recognition using coupled hidden Markov models , 2000, INTERSPEECH.

[10]  Jesús Chamorro-Martínez,et al.  Diatom autofocusing in brightfield microscopy: a comparative study , 2000, Proceedings 15th International Conference on Pattern Recognition. ICPR-2000.

[11]  Fernando Pereira,et al.  Weighted finite-state transducers in speech recognition , 2002, Comput. Speech Lang..

[12]  Timothy F. Cootes,et al.  Extraction of Visual Features for Lipreading , 2002, IEEE Trans. Pattern Anal. Mach. Intell..

[13]  Sadaoki Furui,et al.  Audio-visual speech recognition using lip movement extracted from side-face images , 2003, AVSP.

[14]  Gabriel Fernandez,et al.  Video Shot Boundary Detection Based on Color Histogram , 2003, TREC Video Retrieval Evaluation.

[15]  Sadaoki Furui,et al.  Multi-Modal Speech Recognition Using Optical-Flow Analysis for Lip Images , 2004, J. VLSI Signal Process..

[16]  Juergen Luettin,et al.  Audio-Visual Automatic Speech Recognition: An Overview , 2004 .

[17]  Yoni Bauduin,et al.  Audio-Visual Speech Recognition , 2004 .

[18]  Jon Barker,et al.  An audio-visual corpus for speech perception and automatic speech recognition. , 2006, The Journal of the Acoustical Society of America.

[19]  Petros Maragos,et al.  Adaptive multimodal fusion by uncertainty compensation , 2006, INTERSPEECH.

[20]  Sridha Sridharan,et al.  Patch-Based Representation of Visual Speech , 2006 .

[21]  Jürgen Schmidhuber,et al.  Connectionist temporal classification: labelling unsegmented sequence data with recurrent neural networks , 2006, ICML.

[22]  Petros Maragos,et al.  Multimodal Fusion and Learning with Uncertain Features Applied to Audiovisual Speech Recognition , 2007, 2007 IEEE 9th Workshop on Multimedia Signal Processing.

[23]  Alan Wee-Chung Liew,et al.  An Automatic Lipreading System for Spoken Digits With Limited Training Data , 2008, IEEE Transactions on Circuits and Systems for Video Technology.

[24]  Jean-Philippe Thiran,et al.  Information Theoretic Feature Extraction for Audio-Visual Speech Recognition , 2009, IEEE Transactions on Signal Processing.

[25]  Matti Pietikäinen,et al.  This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. IEEE TRANSACTIONS ON MULTIMEDIA 1 Lipreading with Local Spatiotemporal Descriptors , 2022 .

[26]  Petros Maragos,et al.  Adaptive Multimodal Fusion by Uncertainty Compensation With Application to Audiovisual Speech Recognition , 2009, IEEE Transactions on Audio, Speech, and Language Processing.

[27]  Takeshi Saitoh,et al.  A study of influence of word lip reading by change of frame rate , 2010, AVSP.

[28]  Craig Chambers,et al.  FlumeJava: easy, efficient data-parallel pipelines , 2010, PLDI '10.

[29]  Jayavardhana Gubbi,et al.  Lip reading using optical flow and support vector machines , 2010, 2010 3rd International Congress on Image and Signal Processing.

[30]  Juhan Nam,et al.  Multimodal Deep Learning , 2011, ICML.

[31]  Tara N. Sainath,et al.  FUNDAMENTAL TECHNOLOGIES IN MODERN SPEECH RECOGNITION Digital Object Identifier 10.1109/MSP.2012.2205597 , 2012 .

[32]  Hank Liao,et al.  Large scale deep neural network acoustic modeling with semi-supervised training data for YouTube video transcription , 2013, 2013 IEEE Workshop on Automatic Speech Recognition and Understanding.

[33]  Ahmad B. A. Hassanat,et al.  Visual Speech Recognition , 2011, ArXiv.

[34]  A. Kho,et al.  Silence in the EHR: infrequent documentation of aphonia in the electronic health record , 2014, BMC Health Services Research.

[35]  Tetsuya Ogata,et al.  Lipreading using convolutional neural network , 2014, INTERSPEECH.

[36]  Matti Pietikäinen,et al.  A review of recent advances in visual speech decoding , 2014, Image Vis. Comput..

[37]  Barry-John Theobald,et al.  The effect of speaking rate on audio and visual speech , 2014, 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[38]  Thomas Brox,et al.  Striving for Simplicity: The All Convolutional Net , 2014, ICLR.

[39]  Hermann Ney,et al.  Deep Learning of Mouth Shapes for Sign Language , 2015, 2015 IEEE International Conference on Computer Vision Workshop (ICCVW).

[40]  James Philbin,et al.  FaceNet: A unified embedding for face recognition and clustering , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[41]  Satoshi Tamura,et al.  Integration of deep bottleneck features for audio-visual speech recognition , 2015, INTERSPEECH.

[42]  Yajie Miao,et al.  EESEN: End-to-end speech recognition using deep RNN models and WFST-based decoding , 2015, 2015 IEEE Workshop on Automatic Speech Recognition and Understanding (ASRU).

[43]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[44]  Mohammed Bennamoun,et al.  Listening with Your Eyes: Towards a Practical Visual Speech Recognition System Using Deep Boltzmann Machines , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[45]  Tara N. Sainath,et al.  Large vocabulary automatic speech recognition for children , 2015, INTERSPEECH.

[46]  Andrew Zisserman,et al.  Very Deep Convolutional Networks for Large-Scale Image Recognition , 2014, ICLR.

[47]  Shimon Whiteson,et al.  LipNet: End-to-End Sentence-level Lipreading , 2016, 1611.01599.

[48]  Robert M. Nickel,et al.  Dynamic Stream Weighting for Turbo-Decoding-Based Audiovisual ASR , 2016, INTERSPEECH.

[49]  Tetsuya Takiguchi,et al.  Audio-Visual Speech Recognition Using Bimodal-Trained Bottleneck Features for a Person with Severe Hearing Loss , 2016, INTERSPEECH.

[50]  Ian McGraw,et al.  Personalized speech recognition on mobile devices , 2016, 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[51]  Abhinav Thanda,et al.  Audio Visual Speech Recognition Using Deep Recurrent Neural Networks , 2016, MPRSS.

[52]  Maja Pantic,et al.  Deep complementary bottleneck features for visual speech recognition , 2016, 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[53]  Jürgen Schmidhuber,et al.  Lipreading with long short-term memory , 2016, 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[54]  Joon Son Chung,et al.  Lip Reading in the Wild , 2016, ACCV.

[55]  Joon Son Chung,et al.  Out of Time: Automated Lip Sync in the Wild , 2016, ACCV Workshops.

[56]  Chong Wang,et al.  Deep Speech 2 : End-to-End Speech Recognition in English and Mandarin , 2015, ICML.

[57]  Brian Roark,et al.  Learning N-Gram Language Models from Uncertain Data , 2016, INTERSPEECH.

[58]  Richard Harvey,et al.  Decoding visemes: Improving machine lip-reading , 2016, 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[59]  Stephen J. Cox,et al.  Improved speaker independent lip reading using speaker adaptive training and deep neural networks , 2016, 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[60]  Amirsina Torfi,et al.  3D Convolutional Neural Networks for Cross Audio-Visual Matching Recognition , 2017, IEEE Access.

[61]  Themos Stafylakis,et al.  Combining Residual Networks with LSTMs for Lipreading , 2017, INTERSPEECH.

[62]  Richard Harvey,et al.  Phoneme-to-viseme mappings: the good, the bad, and the ugly , 2017, Speech Commun..

[63]  Shmuel Peleg,et al.  Vid2speech: Speech reconstruction from silent video , 2017, 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[64]  Karel Palecek Utilizing Lipreading in Large Vocabulary Continuous Speech Recognition , 2017, SPECOM.

[65]  Tara N. Sainath,et al.  A Comparison of Sequence-to-Sequence Models for Speech Recognition , 2017, INTERSPEECH.

[66]  Joon Son Chung,et al.  Lip Reading in Profile , 2017, BMVC.

[67]  Maja Pantic,et al.  End-to-End Multi-View Lipreading , 2017, BMVC.

[68]  Ben P. Milner,et al.  Generating Intelligible Audio Speech From Visual Speech , 2017, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[69]  Hagen Soltau,et al.  Neural Speech Recognizer: Acoustic-to-Word LSTM Model for Large Vocabulary Speech Recognition , 2016, INTERSPEECH.

[70]  Jürgen Schmidhuber,et al.  Improving Speaker-Independent Lipreading with Domain-Adversarial Training , 2017, INTERSPEECH.

[71]  Joon Son Chung,et al.  Lip Reading Sentences in the Wild , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[72]  Gerasimos Potamianos,et al.  Exploring ROI size in deep learning based lipreading , 2017, AVSP.

[73]  Rohit Prabhavalkar,et al.  Exploring architectures, data and units for streaming end-to-end speech recognition with RNN-transducer , 2017, 2017 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU).

[74]  Shmuel Peleg,et al.  Visual Speech Enhancement using Noise-Invariant Training , 2017, ArXiv.

[75]  Kevin Wilson,et al.  Looking to listen at the cocktail party , 2018, ACM Trans. Graph..

[76]  Themos Stafylakis,et al.  Pushing the boundaries of audiovisual word recognition using Residual Networks and LSTMs , 2018, Comput. Vis. Image Underst..

[77]  Kaiming He,et al.  Group Normalization , 2018, ECCV.

[78]  Joon Son Chung,et al.  LRS3-TED: a large-scale dataset for visual speech recognition , 2018, ArXiv.

[79]  Kai Xu,et al.  LCANet: End-to-End Lipreading with Cascaded Attention-CTC , 2018, 2018 13th IEEE International Conference on Automatic Face & Gesture Recognition (FG 2018).

[80]  Joon Son Chung,et al.  The Conversation: Deep Audio-Visual Speech Enhancement , 2018, INTERSPEECH.

[81]  Liangliang Cao,et al.  Lip2Audspec: Speech Reconstruction from Silent Lip Movements Video , 2017, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[82]  Shmuel Peleg,et al.  Visual Speech Enhancement , 2017, INTERSPEECH.

[83]  Joon Son Chung,et al.  Deep Audio-Visual Speech Recognition , 2018, IEEE Transactions on Pattern Analysis and Machine Intelligence.