论文信息 - Real Time Voice Processing with Audiovisual Feedback: Toward Autonomous Agents with Perfect Pitch

Real Time Voice Processing with Audiovisual Feedback: Toward Autonomous Agents with Perfect Pitch

We have implemented a real time front end for detecting voiced speech and estimating its fundamental frequency. The front end performs the signal processing for voice-driven agents that attend to the pitch contours of human speech and provide continuous audiovisual feedback. The algorithm we use for pitch tracking has several distinguishing features: it makes no use of FFTs or autocorrelation at the pitch period; it updates the pitch incrementally on a sample-by-sample basis; it avoids peak picking and does not require interpolation in time or frequency to obtain high resolution estimates; and it works reliably over a four octave range, in real time, without the need for postprocessing to produce smooth contours. The algorithm is based on two simple ideas in neural computation: the introduction of a purposeful nonlinearity, and the error signal of a least squares fit. The pitch tracker is used in two real time multimedia applications: a voice-to-MIDI player that synthesizes electronic music from vocalized melodies, and an audiovisual Karaoke machine with multimodal feedback. Both applications run on a laptop and display the user's pitch scrolling across the screen as he or she sings into the computer.

[1] A. Noll. Cepstrum pitch determination. , 1967, The Journal of the Acoustical Society of America.

[2] M. Schroeder. Period histogram and product spectrum: new methods for fundamental-frequency measurement. , 1968, The Journal of the Acoustical Society of America.

[3] B Gold,et al. Parallel processing techniques for estimating pitch periods of speech in the time domain. , 1969, The Journal of the Acoustical Society of America.

[4] Lawrence R. Rabiner,et al. On the use of autocorrelation analysis for pitch detection , 1977 .

[5] Wolfgang Hess,et al. Pitch Determination of Speech Signals: Algorithms and Devices , 1983 .

[6] George R. Doddington,et al. An integrated pitch tracking algorithm for speech systems , 1983, ICASSP.

[7] Michael S. Phillips. A feature‐based time domain pitch tracker , 1985 .

[8] Eyal Yair,et al. Super resolution pitch determination of speech signals , 1991, IEEE Trans. Signal Process..

[9] Q. Summerfield. Book Review: Auditory Scene Analysis: The Perceptual Organization of Sound , 1992 .

[10] G. Kramer. Auditory Scene Analysis: The Perceptual Organization of Sound by Albert Bregman (review) , 2016 .

[11] Paul C. Bagshaw,et al. Enhanced pitch tracking and the processing of F0 contours for computer aided intonation teaching , 1993, EUROSPEECH.

[12] Biing-Hwang Juang,et al. Fundamentals of speech recognition , 1993, Prentice Hall signal processing series.

[13] W M Hartmann,et al. Pitch, periodicity, and auditory organization. , 1996, The Journal of the Acoustical Society of America.

[14] Daniel P. W. Ellis,et al. The auditory organization of speech and other sources in listeners and computational models , 2001, Speech Commun..

[15] Craig Stuart Sapp,et al. Efficient Pitch Detection Techniques for Interactive Music , 2001, ICMC.

[16] Coarticulation • Suprasegmentals,et al. Acoustic Phonetics , 2019, The SAGE Encyclopedia of Human Communication Sciences and Disorders.