Connectionist speaker normalization and its applications to speech recognition

Speaker normalization may have a significant impact on both speaker-adaptive and speaker-independent speech recognition. In this paper, a codeword-dependent neural network (CDNN) is presented for speaker normalization. The network is used as a nonlinear mapping function to transform speech data between two speakers. The mapping function is characterized by two important properties. First, the assembly of mapping functions enhances overall mapping quality. Second, multiple input vectors are used simultaneously in the transformation. This not only makes full use of dynamic information but also alleviates possible errors in the supervision data. Large-vocabulary continuous speech recognition is chosen to study the effect of speaker normalization. Using speaker-dependent semi-continuous hidden Markov models, performance evaluation over 360 testing sentences from new speakers showed that speaker normalization significantly reduced the error rate from 41.9% to 5.0% when only 40 speaker-dependent sentences were used to estimate CDNN parameters.<<ETX>>

[1]  S. Chiba,et al.  Dynamic programming algorithm optimization for spoken word recognition , 1978 .

[2]  Robert M. Gray,et al.  An Algorithm for Vector Quantizer Design , 1980, IEEE Trans. Commun..

[3]  Geoffrey E. Hinton,et al.  Learning internal representations by error propagation , 1986 .

[4]  Yves Grenier,et al.  Spectral transformations through canonical correlation analysis for speaker adptation in ASR , 1986, ICASSP '86. IEEE International Conference on Acoustics, Speech, and Signal Processing.

[5]  Richard P. Lippmann Neutral nets for computing , 1988, ICASSP-88., International Conference on Acoustics, Speech, and Signal Processing.

[6]  Alex Waibel,et al.  Noise reduction using connectionist models , 1988, ICASSP-88., International Conference on Acoustics, Speech, and Signal Processing.

[7]  Satoshi Nakamura,et al.  Voice conversion through vector quantization , 1988, ICASSP-88., International Conference on Acoustics, Speech, and Signal Processing.

[8]  K. Choukri,et al.  Speech recognition using temporal decomposition and multi-layer feed-forward automata , 1989, International Conference on Acoustics, Speech, and Signal Processing,.

[9]  Mei-Yuh Hwang,et al.  The SPHINX speech recognition system , 1989, International Conference on Acoustics, Speech, and Signal Processing,.

[10]  Alex Waibel,et al.  Consonant recognition by modular construction of large phonemic time-delay neural networks , 1989, International Conference on Acoustics, Speech, and Signal Processing,.

[11]  Alex Waibel,et al.  The Meta-Pi network: connectionist rapid adaptation for high-performance multi-speaker phoneme recognition , 1990, International Conference on Acoustics, Speech, and Signal Processing.

[12]  Ken-ichi Iso,et al.  Speaker-independent word recognition using a neural prediction model , 1990, International Conference on Acoustics, Speech, and Signal Processing.

[13]  Jonathan G. Fiscus,et al.  DARPA Resource Management Benchmark Test Results June 1990 , 1990, HLT.

[14]  Peter Regel-Brietzmann,et al.  Fast speaker adaptation for speech recognition systems , 1990, International Conference on Acoustics, Speech, and Signal Processing.

[15]  Richard M. Stern,et al.  Environmental robustness in automatic speech recognition , 1990, International Conference on Acoustics, Speech, and Signal Processing.

[16]  Hsiao-Wuen Hon,et al.  An overview of the SPHINX speech recognition system , 1990, IEEE Trans. Acoust. Speech Signal Process..

[17]  Satoshi Nakamura,et al.  A comparative study of spectral mapping for speaker adaptation , 1990, International Conference on Acoustics, Speech, and Signal Processing.

[18]  Tetsunori Kobayashi,et al.  Application of neural networks to articulatory motion estimation , 1991, [Proceedings] ICASSP 91: 1991 International Conference on Acoustics, Speech, and Signal Processing.

[19]  W. Bastiaan Kleijn,et al.  Acoustic to articulatory parameter mapping using an assembly of neural networks , 1991, [Proceedings] ICASSP 91: 1991 International Conference on Acoustics, Speech, and Signal Processing.

[20]  Xuedong Huang,et al.  A Study on Speaker-Adaptive Speech Recognition , 1991, HLT.

[21]  Kai-Fu Lee,et al.  On speaker-independent, speaker-dependent, and speaker-adaptive speech recognition , 1991, [Proceedings] ICASSP 91: 1991 International Conference on Acoustics, Speech, and Signal Processing.

[22]  Alejandro Acero,et al.  Acoustical and environmental robustness in automatic speech recognition , 1991 .