论文信息 - Improving Speaker-Independent Lipreading with Domain-Adversarial Training

Improving Speaker-Independent Lipreading with Domain-Adversarial Training

We present a Lipreading system, i.e. a speech recognition system using only visual features, which uses domain-adversarial training for speaker independence. Domain-adversarial training is integrated into the optimization of a lipreader based on a stack of feedforward and LSTM (Long Short-Term Memory) recurrent neural networks, yielding an end-to-end trainable system which only requires a very small number of frames of untranscribed target data to substantially improve the recognition accuracy on the target speaker. On pairs of different source and target speakers, we achieve a relative accuracy improvement of around 40% with only 15 to 20 seconds of untranscribed target speech data. On multi-speaker training setups, the accuracy improvements are smaller but still substantial.

Jürgen Schmidhuber | Michael Wand | J. Schmidhuber | Michael Wand

[1] Jenq-Neng Hwang,et al. Lipreading from color video , 1997, IEEE Trans. Image Process..

[2] Dorothea Kolossa,et al. Learning Dynamic Stream Weights For Coupled-HMM-Based Audio-Visual Speech Recognition , 2015, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[3] Jon Barker,et al. An audio-visual corpus for speech perception and automatic speech recognition. , 2006, The Journal of the Acoustical Society of America.

[4] J. Urgen Schmidhuber,et al. Learning Factorial Codes by Predictability Minimization , 1992 .

[5] Gérard Chollet,et al. Continuous-speech phone recognition from ultrasound and optical images of the tongue and lips , 2007, INTERSPEECH.

[6] Barry-John Theobald,et al. Improving visual features for lip-reading , 2010, AVSP.

[7] Tanja Schultz,et al. Tackling Speaking Mode Varieties in EMG-Based Speech Recognition , 2014, IEEE Transactions on Biomedical Engineering.

[8] Gérard Chollet,et al. Development of a silent speech interface driven by ultrasound and optical images of the tongue and lips , 2010, Speech Commun..

[9] Tetsuya Ogata,et al. Lipreading using convolutional neural network , 2014, INTERSPEECH.

[10] Timothy F. Cootes,et al. Extraction of Visual Features for Lipreading , 2002, IEEE Trans. Pattern Anal. Mach. Intell..

[11] Eric David Petajan,et al. Automatic Lipreading to Enhance Speech Recognition (Speech Reading) , 1984 .

[12] Yochai Konig,et al. "Eigenlips" for robust speech recognition , 1994, Proceedings of ICASSP '94. IEEE International Conference on Acoustics, Speech and Signal Processing.

[13] Nitish Srivastava,et al. Improving neural networks by preventing co-adaptation of feature detectors , 2012, ArXiv.

[14] Matti Pietikäinen,et al. A review of recent advances in visual speech decoding , 2014, Image Vis. Comput..

[15] Jürgen Schmidhuber,et al. Learning to Forget: Continual Prediction with LSTM , 2000, Neural Computation.

[16] Tanja Schultz,et al. Towards real-life application of EMG-based speech recognition by using unsupervised adaptation , 2014, INTERSPEECH.

[17] Jürgen Schmidhuber,et al. Lipreading with long short-term memory , 2016, 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[18] Jürgen Schmidhuber,et al. Deep Neural Network Frontend for Continuous EMG-Based Speech Recognition , 2016, INTERSPEECH.

[19] Phil D. Green,et al. A silent speech system based on permanent magnet articulography and direct synthesis , 2016, Comput. Speech Lang..

[20] Maja Pantic,et al. Deep complementary bottleneck features for visual speech recognition , 2016, 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[21] Gregory J. Wolff,et al. Lipreading by Neural Networks: Visual Preprocessing, Learning, and Sensory Integration , 1993, NIPS.

[22] J. M. Gilbert,et al. Silent speech interfaces , 2010, Speech Commun..

[23] Gérard Chollet,et al. Eigentongue Feature Extraction for an Ultrasound-Based Silent Speech Interface , 2007, 2007 IEEE International Conference on Acoustics, Speech and Signal Processing - ICASSP '07.

[24] Joon Son Chung,et al. Lip Reading Sentences in the Wild , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[25] B. Dodd,et al. Review of visual speech perception by hearing and hearing-impaired people: clinical implications. , 2009, International journal of language & communication disorders.

[26] James T. Heaton,et al. Towards a practical silent speech recognition system , 2014, INTERSPEECH.

[27] Shimon Whiteson,et al. LipNet: End-to-End Sentence-level Lipreading , 2016, 1611.01599.

[28] Matti Pietikäinen,et al. This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. IEEE TRANSACTIONS ON MULTIMEDIA 1 Lipreading with Local Spatiotemporal Descriptors , 2022 .

[29] Tanja Schultz,et al. Session-independent EMG-based Speech Recognition , 2011, BIOSIGNALS.

[30] Carlos Busso,et al. Lipreading approach for isolated digits recognition under whisper and neutral speech , 2014, INTERSPEECH.

[31] Barry-John Theobald,et al. Comparing visual features for lipreading , 2009, AVSP.

[32] Jürgen Schmidhuber,et al. Long Short-Term Memory , 1997, Neural Computation.

[33] H. McGurk,et al. Hearing lips and seeing voices , 1976, Nature.

[34] Rich Caruana,et al. Multitask Learning , 1998, Encyclopedia of Machine Learning and Data Mining.

[35] Jun Wang,et al. Speaker-independent silent speech recognition with across-speaker articulatory normalization and speaker adaptive training , 2015, INTERSPEECH.

[36] Robert M. Nickel,et al. Dynamic Stream Weighting for Turbo-Decoding-Based Audiovisual ASR , 2016, INTERSPEECH.

[37] Barry-John Theobald,et al. Recent developments in automated lip-reading , 2013, Optics/Photonics in Security and Defence.

[38] Stephen J. Cox,et al. The challenge of multispeaker lip-reading , 2008, AVSP.

[39] Qiang Yang,et al. A Survey on Transfer Learning , 2010, IEEE Transactions on Knowledge and Data Engineering.

[40] Victor S. Lempitsky,et al. Unsupervised Domain Adaptation by Backpropagation , 2014, ICML.