论文信息 - Amélioration de la conversion de voix chuchotée enregistrée par capteur NAM vers la voix audible

Amélioration de la conversion de voix chuchotée enregistrée par capteur NAM vers la voix audible

The NAM-to-speech conversion proposed by Toda and colleagues which converts Non-Audible Murmur (NAM) to audible speech by statistical mapping trained using aligned corpora is a very promising technique, but its performance is still insufficient. In this paper, we present our current work to improve the intelligibility and the naturalness of the synthesized speech converted from whispered speech with this technique. The first system is proposed to improve F0 estimation and voicing decision. A simple neural network is used to detect voiced segments in the whisper while a GMM estimates a continuous melodic contour based on training voiced segments. In the second system, we attempt to integrate visual information for improving both spectral estimation, F0 estimation and voicing decision.

Gérard Bailly | Christian Jutten | Hélène Loevenbruck | Viet-Anh Tran

[1] Kiyohiro Shikano,et al. Non-audible murmur recognition , 2003, INTERSPEECH.

[2] Tomoki Toda,et al. NAM-to-speech conversion with Gaussian mixture models , 2005, INTERSPEECH.

[3] Tomoki Toda,et al. Improving body transmitted unvoiced speech with statistical voice conversion , 2006, INTERSPEECH.

[4] Gérard Bailly,et al. MOTHER: a new generation of talking heads providing a flexible articulatory control for video-realistic speech animation , 2000, INTERSPEECH.

[5] Alexander Kain,et al. Spectral voice conversion for text-to-speech synthesis , 1998, Proceedings of the 1998 IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP '98 (Cat. No.98CH36181).

[6] Gérard Bailly,et al. SFC: A trainable prosodic model , 2005, Speech Commun..

[7] Kiyohiro Shikano,et al. A tissue-conductive acoustic sensor applied in speech recognition for privacy , 2005, sOc-EUSAI '05.

[8] Gérard Chollet,et al. Continuous-speech phone recognition from ultrasound and optical images of the tongue and lips , 2007, INTERSPEECH.

[9] Keiichi Tokuda,et al. Mapping from articulatory movements to vocal tract spectrum with Gaussian mixture model for articulatory speech synthesis , 2004, SSW.

[10] Timothy F. Cootes,et al. Active Appearance Models , 2001, IEEE Trans. Pattern Anal. Mach. Intell..

[11] Eric Moulines,et al. Continuous probabilistic transform for voice conversion , 1998, IEEE Trans. Speech Audio Process..