The past, present, and future of speech processing

This article provides a succinct review of speech research, in particular its history, current trends, and prospects for the future. The research areas covered are speech analysis and synthesis, speech coding, speech enhancement, speech recognition, spoken language understanding, speaker identification and verification, and multimodal communication.

[1]  Herbert Gish,et al.  Asymptotically efficient quantizing , 1968, IEEE Trans. Inf. Theory.

[2]  G.R. Doddington Speaker recognition—Identifying people by their voices , 1985, Proceedings of the IEEE.

[3]  J. Flanagan,et al.  Computer model to characterize the air volume displaced by the vibrating vocal cords. , 1978, The Journal of the Acoustical Society of America.

[4]  M. Sondhi,et al.  New methods of pitch extraction , 1968 .

[5]  N. G. Zagoruyko,et al.  Automatic recognition of 200 words , 1970 .

[6]  John S. Collura,et al.  MELP: the new Federal Standard at 2400 bps , 1997, 1997 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[7]  Biing-Hwang Juang,et al.  Combining key-phrase detection and subword-based verification for flexible speech understanding , 1997, 1997 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[8]  Lalit R. Bahl,et al.  A Maximum Likelihood Approach to Continuous Speech Recognition , 1983, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[9]  Chin-Hui Lee,et al.  A frame-synchronous network search algorithm for connected word recognition , 1989, IEEE Trans. Acoust. Speech Signal Process..

[10]  Biing-Hwang Juang,et al.  On the application of hidden Markov models for enhancing noisy speech , 1989, IEEE Trans. Acoust. Speech Signal Process..

[11]  J. Flanagan,et al.  Self-oscillating source for vocal-tract synthesizers , 1968 .

[12]  C. Myers,et al.  A level building dynamic time warping algorithm for connected word recognition , 1981 .

[13]  James K. Baker,et al.  Stochastic modeling for automatic speech understanding , 1990 .

[14]  Stephen E. Levinson,et al.  A conversational-mode airline information and reservation system using speech input and output , 1980 .

[15]  D. G. Childers,et al.  Articulatory synthesis: nasal sounds and male and female voices , 1991 .

[16]  Biing-Hwang Juang,et al.  Maximum likelihood estimation for multivariate mixture observations of markov chains , 1986, IEEE Trans. Inf. Theory.

[17]  H. Sakoe,et al.  Two-level DP-matching--A dynamic programming-based pattern matching algorithm for connected word recognition , 1979 .

[18]  Hans Werner Strube Time-varying wave digital filters and vocal-tract models , 1982, ICASSP.

[19]  H Dudley,et al.  The Automatic Synthesis of Speech. , 1939, Proceedings of the National Academy of Sciences of the United States of America.

[20]  W.B. Kleijn,et al.  Transformation and decomposition of the speech signal for coding , 1994, IEEE Signal Processing Letters.

[21]  Biing-Hwang Juang,et al.  An 800 bit/s vector quantization LPC vocoder , 1982 .

[22]  Renato De Mori,et al.  The Application of Semantic Classification Trees to Natural Language Understanding , 1995, IEEE Trans. Pattern Anal. Mach. Intell..

[23]  J. Markel,et al.  A linear prediction vocoder simulation based upon the autocorrelation method , 1974 .

[24]  Q Summerfield,et al.  Use of Visual Information for Phonetic Perception , 1979, Phonetica.

[25]  Aaron E. Rosenberg,et al.  Speaker-independent recognition of isolated words using clustering techniques , 1979 .

[26]  N. Jayant Adaptive Quantization with a One‐Word Memory , 1973 .

[27]  L. Rabiner,et al.  An introduction to hidden Markov models , 1986, IEEE ASSP Magazine.

[28]  Eric Moulines,et al.  Pitch-synchronous waveform processing techniques for text-to-speech synthesis using diphones , 1989, Speech Commun..

[29]  Jean-Luc Gauvain,et al.  Speech recognition for an information kiosk , 1996, Proceeding of Fourth International Conference on Spoken Language Processing. ICSLP '96.

[30]  P. Jacobs,et al.  Qcelp: The North American Cdma Digital Cellular Variable Rate Speech Coding Standard , 1993, Proceedings., IEEE Workshop on Speech Coding for Telecommunications,.

[31]  R. Gray,et al.  Vector quantization of speech and speech-like waveforms , 1982 .

[32]  Julia Hirschberg,et al.  Progress in speech synthesis , 1997 .

[33]  C. Rubinstein,et al.  On the Design of Quantizers for DPCM Coders: Influence of the Subjective Testing Methodology , 1978, IEEE Trans. Commun..

[34]  Hsiao-Wuen Hon,et al.  An overview of the SPHINX speech recognition system , 1990, IEEE Trans. Acoust. Speech Signal Process..

[35]  Allen Gersho,et al.  Principles of quantization , 1978 .

[36]  L. Baum,et al.  A Maximization Technique Occurring in the Statistical Analysis of Probabilistic Functions of Markov Chains , 1970 .

[37]  Frederick Jelinek,et al.  The development of an experimental discrete dictation recognizer , 1985 .

[38]  Douglas B. Paul,et al.  Algorithms for an Optimal A* Search and Linearizing the Search in the Stack Decoder* , 1991, HLT.

[39]  J. Flanagan,et al.  Excitation of vocal-tract synthesizers. , 1969, The Journal of the Acoustical Society of America.

[40]  B. Atal,et al.  Speech analysis and synthesis by linear prediction of the speech wave. , 1971, The Journal of the Acoustical Society of America.

[41]  B. Gold,et al.  Analysis of digital and analog formant synthesizers , 1968 .

[42]  B. Atal,et al.  Predictive coding of speech signals and subjective error criteria , 1979 .

[43]  Karl Hellwig,et al.  A regular-pulse excited linear predictive codec , 1988, Speech Commun..

[44]  Thomas E. Tremain,et al.  The federal standard 1016 4800 bps CELP voice coder , 1991, Digit. Signal Process..

[45]  Biing-Hwang Juang,et al.  Recent developments in the application of hidden Markov models to speaker-independent isolated word recognition , 1985, ICASSP '85. IEEE International Conference on Acoustics, Speech, and Signal Processing.

[46]  Bonnie L. Webber,et al.  Knowledge Representation for Syntactic/Semantic Processing , 1980, AAAI.

[47]  Chin-Hui Lee,et al.  Acoustic modeling for large vocabulary speech recognition , 1990 .

[48]  R. Gray,et al.  Speech coding based upon vector quantization , 1980, ICASSP.

[49]  P. Mermelstein G.722: a new CCITT coding standard for digital transmission of wideband audio signals , 1988, IEEE Communications Magazine.

[50]  J. L. Flanagan,et al.  Automatic generation of voiceless excitation in a vocal cord-vocal tract speech synthesizer , 1976 .

[51]  Robert M. Gray,et al.  An Algorithm for Vector Quantizer Design , 1980, IEEE Trans. Commun..

[52]  Hy Murveit,et al.  Linguistic constraints in hidden Markov model based speech recognition , 1989, International Conference on Acoustics, Speech, and Signal Processing,.

[53]  Biing-Hwang Juang,et al.  Speech recognition in adverse environments , 1991 .

[54]  Bernard M. Smith Instantaneous companding of quantized signals , 1957 .

[55]  Ira Alan Gerson,et al.  Vector Sum Excited Linear Prediction (VSELP) , 1991 .

[56]  Alfred Fettweis,et al.  On adaptors for wave digital filters , 1975 .

[57]  B. Atal,et al.  Improved quantizer for adaptive predictive coding of speech signals at low bit rates , 1980, ICASSP.

[58]  Hans Werner Strube,et al.  Calculations of the time varying vocal tract , 1984, Speech Commun..

[59]  Thomas Baer,et al.  An articulatory synthesizer for perceptual research , 2011 .

[60]  Yeunung Chen,et al.  Cepstral domain talker stress compensation for robust speech recognition , 1988, IEEE Trans. Acoust. Speech Signal Process..

[61]  J. Flanagan Note on the Design of “Terminal‐Analog” Speech Synthesizers , 1957 .

[62]  R. McAulay,et al.  Speech enhancement using a soft-decision noise suppression filter , 1980 .

[63]  Dennis H. Klatt,et al.  Software for a cascade/parallel formant synthesizer , 1980 .

[64]  David G. Stork,et al.  Speechreading: an overview of image processing, feature extraction, sensory integration and pattern recognition techniques , 1996, Proceedings of the Second International Conference on Automatic Face and Gesture Recognition.

[65]  Biing-Hwang Juang,et al.  Signal bias removal by maximum likelihood estimation for robust telephone speech recognition , 1996, IEEE Trans. Speech Audio Process..

[66]  K. Nagata Spoken digit recognizer for Japanese language. , 1963 .

[67]  B. Juang,et al.  A study on robust utterance verification for connected digits recognition , 1997 .

[68]  James L. Flanagan,et al.  HuMaNet: An experimental human-machine communications network based on ISDN wideband audio , 1990 .

[69]  Kuldip K. Paliwal,et al.  Automatic Speech and Speaker Recognition , 1996 .

[70]  K. Davis,et al.  Automatic Recognition of Spoken Digits , 1952 .

[71]  Thomas F. Quatieri,et al.  Speech analysis/Synthesis based on a sinusoidal representation , 1986, IEEE Trans. Acoust. Speech Signal Process..

[72]  Ramesh R. Sarukkai,et al.  Integration of eye fixation information with speech recognition systems , 1997, EUROSPEECH.

[73]  Biing-Hwang Juang,et al.  Speech enhancement with harmonic synthesis , 1983, ICASSP.

[74]  Bertram C. Bruce Case Systems for Natural Language , 1975, Artif. Intell..

[75]  Michael M. Cohen,et al.  Modeling Coarticulation in Synthetic Visual Speech , 1993 .

[76]  Biing-Hwang Juang,et al.  Discriminative learning for minimum error classification [pattern recognition] , 1992, IEEE Trans. Signal Process..

[77]  Biing-Hwang Juang,et al.  Minimum classification error rate methods for speech recognition , 1997, IEEE Trans. Speech Audio Process..

[78]  B.-H. Juang,et al.  Maximum-likelihood estimation for mixture multivariate stochastic observations of Markov chains , 1985, AT&T Technical Journal.

[79]  J. Forgie,et al.  Results Obtained from a Vowel Recognition Computer Program , 1959 .

[80]  George R. Doddington,et al.  Speaker verification over long distance telephone lines , 1989, International Conference on Acoustics, Speech, and Signal Processing,.

[81]  Allen Gersho,et al.  Theory of an Adaptive Quantizer , 1973, IEEE Trans. Commun..

[82]  Lawrence R. Rabiner,et al.  A tutorial on hidden Markov models and selected applications in speech recognition , 1989, Proc. IEEE.

[83]  Victor Zue,et al.  The MIT SUMMIT Speech Recognition System: A Progress Report , 1989, HLT.

[84]  E. J. Diethorn A low-complexity, background-noise reduction preprocessor for speech encoder , 1997, 1997 IEEE Workshop on Speech Coding for Telecommunications Proceedings. Back to Basics: Attacking Fundamental Problems in Speech Coding.

[85]  W. Strong,et al.  A model for the synthesis of natural sounding vowels , 1983 .

[86]  M. Schroeder Period histogram and product spectrum: new methods for fundamental-frequency measurement. , 1968, The Journal of the Acoustical Society of America.

[87]  James D. Johnston,et al.  Transform coding of audio signals using perceptual noise criteria , 1988, IEEE J. Sel. Areas Commun..

[88]  Biing-Hwang Juang,et al.  A vector quantization approach to speaker recognition , 1985, ICASSP '85. IEEE International Conference on Acoustics, Speech, and Signal Processing.

[89]  T. K. Vintsyuk Speech discrimination by dynamic programming , 1968 .

[90]  Harry F. Olson,et al.  Phonetic typewriter , 1957 .

[91]  Lalit R. Bahl,et al.  Design of a linguistic statistical decoder for the recognition of continuous speech , 1975, IEEE Trans. Inf. Theory.

[92]  F. Itakura,et al.  Minimum prediction residual principle applied to speech recognition , 1975 .

[93]  Donald G. Childers,et al.  Formant speech synthesis: improving production quality , 1989, IEEE Trans. Acoust. Speech Signal Process..

[94]  Louis A. Liporace,et al.  Maximum likelihood estimation for multivariate observations of Markov sources , 1982, IEEE Trans. Inf. Theory.

[95]  Biing-Hwang Juang,et al.  Multiple stage vector quantization for speech coding , 1982, ICASSP.

[96]  J. N. Holmes,et al.  Formant synthesizers: Cascade or parallel? , 1983, Speech Commun..

[97]  Giorgio Satta,et al.  Computation of Probabilities for an Island-Driven Parser , 1991, IEEE Trans. Pattern Anal. Mach. Intell..

[98]  Yen-Chun Lin,et al.  A Low-Delay CELP Coder for the CCITT 16 kb/s Speech Coding Standard , 1992, IEEE J. Sel. Areas Commun..

[99]  Sadaoki Furui,et al.  Speaker-independent isolated word recognition using dynamic features of speech spectrum , 1986, IEEE Trans. Acoust. Speech Signal Process..

[100]  E. Petajan,et al.  An improved automatic lipreading system to enhance speech recognition , 1988, CHI '88.

[101]  B. Gold,et al.  Systems for compressing the bandwith of speech , 1967 .

[102]  Robert L. Mercer,et al.  The Mathematics of Statistical Machine Translation: Parameter Estimation , 1993, CL.

[103]  Maxine D. Brown,et al.  Continuous connected word recognition using whole word templates , 1983 .

[104]  W. R. Bennett,et al.  Spectra of quantized signals , 1948, Bell Syst. Tech. J..

[105]  Sadaoki Furui,et al.  An Overview of Speaker Recognition Technology , 1996 .

[106]  Sadaoki Furui,et al.  Research of individuality features in speech waves and automatic speaker recognition techniques , 1986, Speech Commun..

[107]  Biing-Hwang Juang,et al.  A study on speaker adaptation of the parameters of continuous density hidden Markov models , 1991, IEEE Trans. Signal Process..

[108]  Jonathan G. Fiscus,et al.  1993 Benchmark Tests for the ARPA Spoken Language Program , 1994, HLT.

[109]  George S. Moschytz,et al.  Noise reduction by noise-adaptive spectral magnitude expansion , 1994 .

[110]  David Malah,et al.  Speech enhancement using a minimum mean-square error log-spectral amplitude estimator , 1984, IEEE Trans. Acoust. Speech Signal Process..

[111]  Waveforms Hisashi Wakita Direct Estimation of the Vocal Tract Shape by Inverse Filtering of Acoustic Speech , 1973 .

[112]  D H Klatt,et al.  Review of text-to-speech conversion for English. , 1987, The Journal of the Acoustical Society of America.

[113]  James L. Flanagan,et al.  Technologies for multimedia communications , 1994, Proc. IEEE.

[114]  A.V. Oppenheim,et al.  Enhancement and bandwidth compression of noisy speech , 1979, Proceedings of the IEEE.

[115]  Jae S. Lim,et al.  Multiband excitation vocoder , 1988, IEEE Transactions on Acoustics, Speech, and Signal Processing.

[116]  W. H. Sumby,et al.  Visual contribution to speech intelligibility in noise , 1954 .

[117]  D. B. Paul,et al.  The Lincoln robust continuous speech recognizer , 1989, International Conference on Acoustics, Speech, and Signal Processing,.

[118]  Biing-Hwang Juang,et al.  Key-phrase detection and verification for flexible speech understanding , 1996, Proceeding of Fourth International Conference on Spoken Language Processing. ICSLP '96.

[119]  Man Mohan Sondhi,et al.  A hybrid time-frequency domain articulatory speech synthesizer , 1987, IEEE Trans. Acoust. Speech Signal Process..

[120]  John Makhoul,et al.  BYBLOS: The BBN continuous speech recognition system , 1987, ICASSP '87. IEEE International Conference on Acoustics, Speech, and Signal Processing.

[121]  S. Chiba,et al.  Dynamic programming algorithm optimization for spoken word recognition , 1978 .

[122]  Biing-Hwang Juang,et al.  Deployable automatic speech recognition systems: Advances and challenges , 1995, AT&T Technical Journal.

[123]  Mark J. F. Gales,et al.  Robust speech recognition in additive and convolutional noise using parallel model combination , 1995, Comput. Speech Lang..

[124]  Chin-Hui Lee,et al.  A maximum-likelihood approach to stochastic matching for robust speech recognition , 1996, IEEE Trans. Speech Audio Process..

[125]  Takeo Kanade,et al.  Intelligent Access to Digital Video: Informedia Project , 1996, Computer.

[126]  David S. Pallett Session 2: DARPA Resource Management and ATIS Benchmark Test Poster Session , 1991, HLT.

[127]  Lalit R. Bahl,et al.  Automatic recognition of continuously spoken sentences from a finite state grammer , 1978, ICASSP.

[128]  S. Boll,et al.  Suppression of acoustic noise in speech using spectral subtraction , 1979 .