Model-driven detection of clean speech patches in noise

Listeners may be able to recognise speech in adverse conditions by “glimpsing” time-frequency regions where the target speech is dominant. Previous computational attempts to identify such regions have been source-driven, using primitive cues. This paper describes a model-driven approach in which the likelihood of spectro-temporal patches of a noisy mixture representing speech is given by a generative model. The focus is on patch size and patch modelling. Small patches lead to a lack of discrimination, while large patches are more likely to contain contributions from other sources. A “cleanness” measure reveals that a good patch size is one which extends over a quarter of the speech frequency range and lasts for 40 ms. Gaussian mixture models are used to represent patches. A compact representation based on a 2D discrete cosine transform leads to reasonable speech/background discrimination.

[1]  Michael E. Tipping,et al.  Probabilistic Principal Component Analysis , 1999 .

[2]  Jon Barker,et al.  The foreign language cocktail party problem: Energetic and informational masking effects in non-native speech perception. , 2008, The Journal of the Acoustical Society of America.

[3]  Martin Cooke,et al.  Modelling auditory processing and organisation , 1993, Distinguished dissertations in computer science.

[4]  B C Moore,et al.  The shape of the ear's temporal window. , 1988, The Journal of the Acoustical Society of America.

[5]  Jon Barker,et al.  Modelling speaker intelligibility in noise , 2007, Speech Commun..

[6]  Jon Barker,et al.  An audio-visual corpus for speech perception and automatic speech recognition. , 2006, The Journal of the Acoustical Society of America.

[7]  Daniel P. W. Ellis,et al.  Decoding speech in the presence of other sources , 2005, Speech Commun..

[8]  Martin Cooke,et al.  A glimpsing model of speech perception in noise. , 2006, The Journal of the Acoustical Society of America.

[9]  Albert S. Bregman,et al.  The Auditory Scene. (Book Reviews: Auditory Scene Analysis. The Perceptual Organization of Sound.) , 1990 .

[10]  Erkki Oja,et al.  Independent component analysis: algorithms and applications , 2000, Neural Networks.

[11]  Guy J. Brown,et al.  Computational auditory scene analysis , 1994, Comput. Speech Lang..

[12]  Mark J. F. Gales,et al.  HMM recognition in noise using parallel model combination , 1993, EUROSPEECH.

[13]  Roger K. Moore,et al.  Hidden Markov model decomposition of speech and noise , 1990, International Conference on Acoustics, Speech, and Signal Processing.

[14]  Peter F. Assmann,et al.  The Perception of Speech Under Adverse Conditions , 2004 .

[15]  I. Nelken Demonstrations of Auditory Scene Analysis: The Perceptual Organization of Sound by Albert S. Bregman and Pierre A. Ahad, MIT Press, 1996. £15.95 CD , 1997, Trends in Neurosciences.