Dynamic pronunciation models for automatic speech recognition

As of this writing, the automatic recognition of spontaneous speech by computer is fraught with errors; many systems transcribe one out of every three to five words incorrectly, whereas humans can transcribe spontaneous speech with one error in twenty words or better. This high error rate is due in part to the poor modeling of pronunciations within spontaneous speech. This dissertation examines how pronunciations vary in this speaking style, and how speaking rate and word predictability can be used to predict when greater pronunciation variation can be expected. It includes an investigation of the relationship between speaking rate, word predictability, pronunciations, and errors made by speech recognition systems. The results of these studies suggest that for spontaneous speech, it may be appropriate to build models for syllables and words that can dynamically change the pronunciations used in the speech recognizer based on the extended context (including surrounding words, phones, speaking rate, etc.). Implementation of new pronunciation models automatically derived from data within the ICSI speech recognition system has shown a 4–5% relative improvement on the Broadcast News recognition task. Roughly two thirds of these gains can be attributed to static baseform improvements; adding the ability to dynamically adjust pronunciations within the recognizer provides the other third of the improvement. The Broadcast News task also allows for comparison of performance on different styles of speech: the new pronunciation models do not help for pre-planned speech, but they provide a significant gain for spontaneous speech. Not only do the automatically learned pronunciation models capture some of the linguistic variation due to the speaking style, but they also represent variation in the acoustic model due to channel effects. The largest improvement was seen in the telephone speech condition, in which 12% of the errors produced by the baseline system were corrected.

[1]  C. Fowler,et al.  Talkers' signaling of new and old. words in speech and listeners' perception and use of the distinction , 1987 .

[2]  Andreas Stolcke,et al.  Multiple-pronunciation lexical modeling in a speaker independent speech understanding system , 1994, ICSLP.

[3]  Corey Miller,et al.  Pronunciation modeling in speech synthesis , 1998 .

[4]  Katrin Kirchhoff Combining articulatory and acoustic information for speech recognition in noisy and reverberant environments , 1998, ICSLP.

[5]  David B. Pisoni,et al.  Text-to-speech: the mitalk system , 1987 .

[6]  D. Pisoni,et al.  Perception of the duration of rapid spectrum changes in speech and nonspeech signals , 1983, Perception & psychophysics.

[7]  Patti Price,et al.  The DARPA 1000-word resource management database for continuous speech recognition , 1988, ICASSP-88., International Conference on Acoustics, Speech, and Signal Processing.

[8]  Harriet J. Nock,et al.  Pronunciation modeling by sharing gaussian densities across phonetic models , 1999, EUROSPEECH.

[9]  James H. Martin,et al.  Speech and language processing: an introduction to natural language processing, computational linguistics, and speech recognition, 2nd Edition , 2000, Prentice Hall series in artificial intelligence.

[10]  Gitta P. M. Laan The contribution of intonation, segmental durations, and spectral features to the perception of a spontaneous and a read speaking style , 1997, Speech Commun..

[11]  P. Lieberman Some Effects of Semantic and Grammatical Context on the Production and Perception of Speech , 1963 .

[12]  Jonathan G. Fiscus,et al.  A post-processing system to yield reduced word error rates: Recognizer Output Voting Error Reduction (ROVER) , 1997, 1997 IEEE Workshop on Automatic Speech Recognition and Understanding Proceedings.

[13]  Daniel Jurafsky,et al.  Building multiple pronunciation models for novel words using exploratory computational phonology , 1995, EUROSPEECH.

[14]  Lori Lamel,et al.  On designing pronunciation lexicons for large vocabulary continuous speech recognition , 1996, Proceeding of Fourth International Conference on Spoken Language Processing. ICSLP '96.

[15]  M. A. Randolph A data-driven method for discovering and predicting allophonic variation , 1990, International Conference on Acoustics, Speech, and Signal Processing.

[16]  Don McAllaster,et al.  Fabricating conversational speech data with acoustic models: a program to examine model-data mismatch , 1998, ICSLP.

[17]  Bruce Tesar,et al.  Computational optimality theory , 1996 .

[18]  Lotfi A. Zadeh,et al.  Phonological structures for speech recognition , 1989 .

[19]  Kathleen J. Mullen,et al.  Agricultural Policies in India , 2018, OECD Food and Agricultural Reviews.

[20]  Robert F. Port,et al.  The influence of tempo on stop closure duration as a cue for voicing and place , 1979 .

[21]  G. Ayers Discourse functions of pitch range in spontaneous and read speech , 1994 .

[22]  Steven Greenberg,et al.  ON THE ORIGINS OF SPEECH INTELLIGIBILITY IN THE REAL WORLD , 1997 .

[23]  P. Ladefoged,et al.  Phonetic linguistics : essays in honor of Peter Ladefoged , 1987 .

[24]  F. Goldman-Eisler,et al.  Sequential Temporal Patterns in Spontaneous Speech , 1966 .

[25]  Helmer Strik,et al.  Modeling pronunciation variation for a dutch CSR: testing three methods , 1998, ICSLP.

[26]  Hervé Bourlard,et al.  CDNN: a context dependent neural network for continuous speech recognition , 1992, [Proceedings] ICASSP-92: 1992 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[27]  Donald J. Sharf,et al.  Phonetic Analysis of Normal and Abnormal Speech , 1991 .

[28]  Robert A. Jacobs,et al.  Hierarchical Mixtures of Experts and the EM Algorithm , 1993, Neural Computation.

[29]  Christian-Michael Westendorf,et al.  Learning pronunciation dictionary from speech data , 1996, Proceeding of Fourth International Conference on Spoken Language Processing. ICSLP '96.

[30]  S. J. Young,et al.  Tree-based state tying for high accuracy acoustic modelling , 1994 .

[31]  Torbjørn Svendsen,et al.  Maximum likelihood modelling of pronunciation variation , 1999, Speech Commun..

[32]  Yochai Konig,et al.  REMAP: Recursive Estimation and Maximization of A Posteriori Probabilities - Application to Transition-Based Connectionist Speech Recognition , 1995, NIPS.

[33]  A. Liberman,et al.  Some effects of later-occurring information on the perception of stop consonant and semivowel , 1979, Perception & psychophysics.

[34]  Florien J. van Beinum Spectro-temporal reduction and expansion in spontaneous speech and read text: the role of focus words , 1990, ICSLP.

[35]  Kuldip K. Paliwal,et al.  Automatic Speech and Speaker Recognition: Advanced Topics , 1999 .

[36]  Robert I. Damper,et al.  A recurrent network that learns to pronounce English text , 1996, Proceeding of Fourth International Conference on Spoken Language Processing. ICSLP '96.

[37]  Stephen Cox,et al.  A comparison of two unsupervised approaches to accent identification , 1998, ICSLP.

[38]  Detlef Koll,et al.  Modeling and efficient decoding of large vocabulary conversational speech , 1999, EUROSPEECH.

[39]  George Zavaliagkos,et al.  Pronunciation modeling for large vocabulary conversational speech recognition , 1998, ICSLP.

[40]  Steven Bird,et al.  One-Level Phonology: Autosegmental Representations and Rules as Finite Automata , 1994, Comput. Linguistics.

[41]  Andrej Ljolje,et al.  Automatic Generation of Detailed Pronunciation Lexicons , 1996 .

[42]  Martin Kay,et al.  Regular Models of Phonological Rule Systems , 1994, CL.

[43]  W. Labov Principles of Linguistic Change: Internal Factors , 1994 .

[44]  Jean-Pierre Martens,et al.  On the use of pronunciation rules for improved word recognition , 1995, EUROSPEECH.

[45]  Francine R. Chen,et al.  Computational Models of American Speech , 1992 .

[46]  Ellen M. Kaisse Connected Speech: The Interaction of Syntax and Phonology , 1985 .

[47]  Jean-Pierre Martens,et al.  A fast and reliable rate of speech detector , 1996, Proceeding of Fourth International Conference on Spoken Language Processing. ICSLP '96.

[48]  Xuedong Huang,et al.  Improvements on a trainable letter-to-sound converter , 1997, EUROSPEECH.

[49]  J. Wolf,et al.  The HWIM speech understanding system , 1977 .

[50]  R. A. Sharman,et al.  A bi-directional model of English pronunciation , 1991, EUROSPEECH.

[51]  Richard Sproat,et al.  Compilation of Weighted Finite-State Transducers from Decision Trees , 1996, ACL.

[52]  J. Friedman,et al.  Computer exploration of fast-speech rules , 1975 .

[53]  Steven Greenberg,et al.  Speaking in shorthand - A syllable-centric perspective for understanding pronunciation variation , 1999, Speech Commun..

[54]  Roland Kuhn,et al.  Rescoring multiple pronunciations generated from spelled words , 1998, ICSLP.

[55]  Jonathan G. Fiscus,et al.  Darpa Timit Acoustic-Phonetic Continuous Speech Corpus CD-ROM {TIMIT} | NIST , 1993 .

[56]  Rosaria Silipo,et al.  AUTOMATIC TRANSCRIPTION OF PROSODIC STRESS FOR SPONTANEOUS ENGLISH DISCOURSE , 1999 .

[57]  Anthony J. Robinson,et al.  Context-Dependent Classes in a Hybrid Recurrent Network-HMM Speech Recognition System , 1995, NIPS.

[58]  Mitch Weintraub,et al.  Automatic Learning of Word Pronunciation from Data , 1996 .

[59]  W. Nick Campbell Syllable-level duration determination , 1989, EUROSPEECH.

[60]  Fergus McInnes,et al.  Use of acoustic sentence level and lexical stress in HSMM speech recognition , 1992, [Proceedings] ICASSP-92: 1992 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[61]  James R. Glass,et al.  Empirical acquisition of word and phrase classes in the atis domain , 1993, EUROSPEECH.

[62]  Alexander H. Waibel,et al.  Dictionary learning for spontaneous speech recognition , 1996, Proceeding of Fourth International Conference on Spoken Language Processing. ICSLP '96.

[63]  Robert L. Mercer,et al.  An information theoretic approach to the automatic determination of phonemic baseforms , 1984, ICASSP.

[64]  Daniel Gildea,et al.  Forms of English Function Words — Effects of Disfluencies , Turn Position , Age and Sex , and Predictability , 1999 .

[65]  Alex Waibel,et al.  Modeling Systematic Variations in Pronunciation via a Language-Dependent Hidden Speaking Mode , 1999 .

[66]  Harriet J. Nock,et al.  Detecting and correcting poor pronunciations for multiword units , 1998 .

[67]  C. Pollard,et al.  Center for the Study of Language and Information , 2022 .

[68]  Fernando Pereira,et al.  Transducer composition for context-dependent network expansion , 1997, EUROSPEECH.

[69]  Alex Waibel,et al.  Flexible transcription alignment , 1997, 1997 IEEE Workshop on Automatic Speech Recognition and Understanding Proceedings.

[70]  N. Morgan,et al.  INCORPORATING CONTEXTUAL PHONETICS INTO AUTOMATIC SPEECH RECOGNITION , 1999 .

[71]  Noam Chomsky,et al.  The Sound Pattern of English , 1968 .

[72]  Michael Galler,et al.  On the use of stochastic inference networks for representing multiple word pronunciations , 1995, 1995 International Conference on Acoustics, Speech, and Signal Processing.

[73]  T. Crystal,et al.  Segmental durations in connected‐speech signals: Current results , 1988 .

[74]  Ronald A. Cole,et al.  Automatically generated word pronunciations from phoneme classifier output , 1993, 1993 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[75]  William J. Byrne,et al.  Stochastic pronunciation modelling from hand-labelled phonetic corpora , 1999, Speech Commun..

[76]  Steve R. Waterhouse,et al.  Transcription of broadcast television and radio news: the 1996 ABBOT system , 1997, 1997 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[77]  Ellen Eide Automatic modeling of pronunciation variations , 1999, EUROSPEECH.

[78]  William D. Raymond,et al.  Reduction of English function words in switchboard , 1998, ICSLP.

[79]  Joseph Picone,et al.  Improved surname pronunciations using decision trees , 1998, ICSLP.

[80]  Nelson Morgan,et al.  Perceptually inspired signal processing strategies for robust speech recognition in reverberant environments , 1998 .

[81]  Lawrence R. Rabiner,et al.  A tutorial on hidden Markov models and selected applications in speech recognition , 1989, Proc. IEEE.

[82]  Satoshi Kobayashi,et al.  Extraction and representation rhythmic components of spontaneous speech , 1997, EUROSPEECH.

[83]  I. Good THE POPULATION FREQUENCIES OF SPECIES AND THE ESTIMATION OF POPULATION PARAMETERS , 1953 .

[84]  Michael Picheny,et al.  Automatic phonetic baseform determination , 1991, [Proceedings] ICASSP 91: 1991 International Conference on Acoustics, Speech, and Signal Processing.

[85]  H Hermansky,et al.  Perceptual linear predictive (PLP) analysis of speech. , 1990, The Journal of the Acoustical Society of America.

[86]  J L Miller,et al.  How the components of speaking rate influence perception of phonetic segments. , 1981, Journal of experimental psychology. Human perception and performance.

[87]  Brian Kingsbury,et al.  An Overview of the SPRACH System for the Transcription of Broadcast News , 1999 .

[88]  Hy Murveit,et al.  Linguistic constraints in hidden Markov model based speech recognition , 1989, International Conference on Acoustics, Speech, and Signal Processing,.

[89]  Florian Schiel A new approach to speaker adaptation by modelling pronunciation in automatic speech recognition , 1993, Speech Commun..

[90]  C Soares,et al.  The influence of inter- and intra-speaker tempo on fundamental frequency and palatalization. , 1983, The Journal of the Acoustical Society of America.

[91]  V. Zue,et al.  The role of phonological rules in speech understanding research , 1975 .

[92]  H. Levin,et al.  The Prosodic and Paralinguistic Features of Reading and Telling Stories , 1982 .

[93]  Lalit R. Bahl,et al.  Recognition of continuously read natural corpus , 1978, ICASSP.

[94]  Eric Fosler-Lussier,et al.  Towards robustness to fast speech in ASR , 1996, 1996 IEEE International Conference on Acoustics, Speech, and Signal Processing Conference Proceedings.

[95]  Alexander H. Waibel,et al.  Speaking mode dependent pronunciation modeling in large vocabulary conversational speech recognition , 1997, EUROSPEECH.

[96]  Frederick Jelinek,et al.  Interpolated estimation of Markov source parameters from sparse data , 1980 .

[97]  Richard M. Stern,et al.  The 1996 Hub-4 Sphinx-3 System , 1997 .

[98]  W. Ganong Phonetic categorization in auditory word perception. , 1980, Journal of experimental psychology. Human perception and performance.

[99]  Eric Fosler-Lussier,et al.  Multi-level decision trees for static and dynamic pronunciation models , 1999, EUROSPEECH.

[100]  J L Miller,et al.  Some effects of speaking rate on the production of /b/ and /w/. , 1983, The Journal of the Acoustical Society of America.

[101]  Jason J. Humphries Accent modelling and adaptation in automatic speech recognition , 1998 .

[102]  Eric Fosler-Lussier,et al.  Fast speakers in large vocabulary continuous speech recognition: analysis & antidotes , 1995, EUROSPEECH.

[103]  Eric Fosler-Lussier,et al.  Not just what, but also when: Guided automatic pronunciation modeling for Broadcast News , 1999 .

[104]  Jonathan G. Fiscus,et al.  1998 Broadcast News Benchmark Test Results: English and Non-English Word Error Rate Performance Measures , 1998 .

[105]  Horacio Franco,et al.  Hybrid neural network/hidden Markov model continuous-speech recognition , 1992, ICSLP.

[106]  Slava M. Katz,et al.  Estimation of probabilities from sparse data for the language model component of a speech recognizer , 1987, IEEE Trans. Acoust. Speech Signal Process..

[107]  Joseph Picone,et al.  An advanced system to generate pronunciations of proper nouns , 1997, 1997 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[108]  Q. Summerfield Articulatory rate and perceptual constancy in phonetic perception. , 1981, Journal of experimental psychology. Human perception and performance.

[109]  Mitch Weintraub,et al.  WS96 project report: Automatic learning of word pronunciation from data , 1997 .

[110]  Victor Zue,et al.  Statistical and linguistic analyses of F0 in read and spontaneous speech , 1992, ICSLP.

[111]  Terrence J. Sejnowski,et al.  Parallel Networks that Learn to Pronounce English Text , 1987, Complex Syst..

[112]  Eric Fosler-Lussier,et al.  Combining multiple estimators of speaking rate , 1998, Proceedings of the 1998 IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP '98 (Cat. No.98CH36181).

[113]  Jean-Pierre Martens,et al.  In Search of Pronunciation Rules , 1998 .

[114]  Daniel Jurafsky,et al.  Learning Phonological Rule Probabilities from Speech Corpora with Exploratory Computational Phonology , 1995, ACL.

[115]  Gethin Williams,et al.  Knowing What You Don't Know: Roles for Confidence Measures in Automatic Speech Recognition , 1999 .

[116]  Kenneth Ward Church Phonological parsing in speech recognition , 1987 .

[117]  Michael Riley,et al.  A statistical model for generating pronunciation networks , 1991, [Proceedings] ICASSP 91: 1991 International Conference on Acoustics, Speech, and Signal Processing.

[118]  Shigeki Sagayama,et al.  Phoneme environment clustering for speech recognition , 1989, International Conference on Acoustics, Speech, and Signal Processing,.

[119]  Steve Renals,et al.  DECODER TECHNOLOGY FOR CONNECTIONIST LARGE VOCABULARY SPEECH RECOGNITION , 1995 .

[120]  T. Mark Ellison,et al.  Phonological Derivation in Optimality Theory , 1994, COLING.

[121]  Florian Schiel,et al.  Statistical Modelling Of Pronunciation: It's Not The Model, It's The Data , 1998 .

[122]  William M. Fisher A statistical text-to-phone function using ngrams and rules , 1999, 1999 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings. ICASSP99 (Cat. No.99CH36258).

[123]  Jason Eisner,et al.  Eecient Generation in Primitive Optimality Theory , 1997 .

[124]  Yoshinori Sagisaka,et al.  Automatic generation of multiple pronunciations based on neural networks , 1999, Speech Commun..

[125]  Bruce T. Lowerre,et al.  The HARPY speech recognition system , 1976 .

[126]  Lennart Nord,et al.  Prediction of syllable duration, speech rate and tempo , 1992, ICSLP.

[127]  Hervé Bourlard,et al.  Connectionist Speech Recognition: A Hybrid Approach , 1993 .

[128]  Daniel Gildea,et al.  Learning Bias and Phonological-Rule Induction , 1996, CL.