Dynamic Features for Visual Speechreading: A Systematic Comparison

Humans use visual as well as auditory speech signals to recognize spoken words. A variety of systems have been investigated for performing this task. The main purpose of this research was to systematically compare the performance of a range of dynamic visual features on a speechreading task. We have found that normalization of images to eliminate variation due to translation, scale, and planar rotation yielded substantial improvements in generalization performance regardless of the visual representation used. In addition, the dynamic information in the difference between successive frames yielded better performance than optical-flow based approaches, and compression by local low-pass filtering worked surprisingly better than global principal components analysis (PCA). These results are examined and possible explanations are explored.

[1]  Yochai Konig,et al.  "Eigenlips" for robust speech recognition , 1994, Proceedings of ICASSP '94. IEEE International Conference on Acoustics, Speech and Signal Processing.

[2]  H. McGurk,et al.  Hearing lips and seeing voices , 1976, Nature.

[3]  Alex Pentland,et al.  Automatic lipreading by optical-flow analysis , 1989 .

[4]  Javier R. Movellan,et al.  Visual Speech Recognition with Stochastic Networks , 1994, NIPS.

[5]  Gregory J. Wolff,et al.  Lipreading by Neural Networks: Visual Preprocessing, Learning, and Sensory Integration , 1993, NIPS.

[6]  D W Massaro,et al.  American Psychological Association, Inc. Evaluation and Integration of Visual and Auditory Information in Speech Perception , 2022 .

[7]  R. Wurtz,et al.  Sensitivity of MST neurons to optic flow stimuli. I. A continuum of response selectivity to large-field stimuli. , 1991, Journal of neurophysiology.

[8]  Terrence J. Sejnowski,et al.  Neural network models of sensory integration for improved vowel recognition , 1990, Proc. IEEE.

[9]  Juergen Luettin,et al.  Visual speech recognition using active shape models and hidden Markov models , 1996, 1996 IEEE International Conference on Acoustics, Speech, and Signal Processing Conference Proceedings.

[10]  P. L. Silsbee Sensory integration in audiovisual automatic speech recognition , 1994, Proceedings of 1994 28th Asilomar Conference on Signals, Systems and Computers.