论文信息 - Sample Efficient Adaptive Text-to-Speech

Sample Efficient Adaptive Text-to-Speech

We present a meta-learning approach for adaptive text-to-speech (TTS) with few data. During training, we learn a multi-speaker model using a shared conditional WaveNet core and independent learned embeddings for each speaker. The aim of training is not to produce a neural network with fixed weights, which is then deployed as a TTS system. Instead, the aim is to produce a network that requires few data at deployment time to rapidly adapt to new speakers. We introduce and benchmark three strategies: (i) learning the speaker embedding while keeping the WaveNet core fixed, (ii) fine-tuning the entire architecture with stochastic gradient descent, and (iii) predicting the speaker embedding with a trained neural network encoder. The experiments show that these approaches are successful at adapting the multi-speaker neural network to new speakers, obtaining state-of-the-art results in both sample naturalness and voice similarity with merely a few minutes of audio data from new speakers.

[1] H. Harlow,et al. The formation of learning sets. , 1949, Psychological review.

[2] Joseph P. Olive,et al. Text-to-speech synthesis , 1995, AT&T Technical Journal.

[3] T. Dutoit. An introduction to text-to-speech synthesis , 1997 .

[4] Hideki Kawahara,et al. YIN, a fundamental frequency estimator for speech and music. , 2002, The Journal of the Acoustical Society of America.

[5] Tara N. Sainath,et al. FUNDAMENTAL TECHNOLOGIES IN MODERN SPEECH RECOGNITION Digital Object Identifier 10.1109/MSP.2012.2205597 , 2012 .

[6] Geoffrey E. Hinton,et al. ImageNet classification with deep convolutional neural networks , 2012, Commun. ACM.

[7] Hui Jiang,et al. Fast speaker adaptation of hybrid NN/HMM model for speech recognition based on discriminative learning of speaker code , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[8] Richard Sproat,et al. The Kestrel TTS text normalization system , 2014, Natural Language Engineering.

[9] Sanjeev Khudanpur,et al. Librispeech: An ASR corpus based on public domain audio books , 2015, 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[10] Alex Graves,et al. DRAW: A Recurrent Neural Network For Image Generation , 2015, ICML.

[11] Dumitru Erhan,et al. Going deeper with convolutions , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[12] Daan Wierstra,et al. One-Shot Generalization in Deep Generative Models , 2016, ICML.

[13] Heiga Zen,et al. Fast, Compact, and High Quality LSTM-RNN Based Statistical Parametric Speech Synthesizers for Mobile Devices , 2016, INTERSPEECH.

[14] Daan Wierstra,et al. Meta-Learning with Memory-Augmented Neural Networks , 2016, ICML.

[15] Marcin Andrychowicz,et al. Learning to learn by gradient descent by gradient descent , 2016, NIPS.

[16] Koray Kavukcuoglu,et al. Pixel Recurrent Neural Networks , 2016, ICML.

[17] Masanori Morise,et al. WORLD: A Vocoder-Based High-Quality Speech Synthesis System for Real-Time Applications , 2016, IEICE Trans. Inf. Syst..

[18] Oriol Vinyals,et al. Matching Networks for One Shot Learning , 2016, NIPS.

[19] Junichi Yamagishi,et al. SUPERSEDED - CSTR VCTK Corpus: English Multi-speaker Corpus for CSTR Voice Cloning Toolkit , 2016 .

[20] George Kurian,et al. Google's Neural Machine Translation System: Bridging the Gap between Human and Machine Translation , 2016, ArXiv.

[21] Yoshihiko Nankaku,et al. Redefining the Linguistic Context Feature Set for HMM and DNN TTS Through Position and Parsing , 2016, INTERSPEECH.

[22] Heiga Zen,et al. WaveNet: A Generative Model for Raw Audio , 2016, SSW.

[23] Hugo Larochelle,et al. Optimization as a Model for Few-Shot Learning , 2016, ICLR.

[24] Sergey Levine,et al. One-Shot Visual Imitation Learning via Meta-Learning , 2017, CoRL.

[25] C A Nelson,et al. Learning to Learn , 2017, Encyclopedia of Machine Learning and Data Mining.

[26] Sercan Ömer Arik,et al. Deep Voice 2: Multi-Speaker Neural Text-to-Speech , 2017, NIPS.

[27] Adam Coates,et al. Deep Voice: Real-time Neural Text-to-Speech , 2017, ICML.

[28] Sercan Ömer Arik,et al. Deep Voice 3: 2000-Speaker Neural Text-to-Speech , 2017, ICLR 2018.

[29] Yoshua Bengio,et al. Char2Wav: End-to-End Speech Synthesis , 2017, ICLR.

[30] Samy Bengio,et al. Tacotron: Towards End-to-End Speech Synthesis , 2017, INTERSPEECH.

[31] Dmitry P. Vetrov,et al. Fast Adaptation in Generative Models with Generative Matching Networks , 2016, ICLR.

[32] Ambedkar Dukkipati,et al. Attentive Recurrent Comparators , 2017, ICML.

[33] Tor Lattimore,et al. Online Learning with Gated Linear Networks , 2017, ArXiv.

[34] Misha Denil,et al. Learning to Learn without Gradient Descent by Gradient Descent , 2016, ICML.

[35] Sergey Levine,et al. Model-Agnostic Meta-Learning for Fast Adaptation of Deep Networks , 2017, ICML.

[36] Jörg Bornschein,et al. Variational Memory Addressing in Generative Models , 2017, NIPS.

[37] Dong Wang,et al. Deep Speaker Feature Learning for Text-Independent Speaker Verification , 2017, INTERSPEECH.

[38] Yuxuan Wang,et al. Towards End-to-End Prosody Transfer for Expressive Speech Synthesis with Tacotron , 2018, ICML.

[39] Nando de Freitas,et al. Playing hard exploration games by watching YouTube , 2018, NeurIPS.

[40] Lior Wolf,et al. Fitting New Speakers Based on a Short Untranscribed Sample , 2018, ICML.

[41] Lior Wolf,et al. VoiceLoop: Voice Fitting and Synthesis via a Phonological Loop , 2017, ICLR.

[42] Patrick Nguyen,et al. Transfer Learning from Speaker Verification to Multispeaker Text-To-Speech Synthesis , 2018, NeurIPS.

[43] Rémi Munos,et al. Observe and Look Further: Achieving Consistent Performance on Atari , 2018, ArXiv.

[44] Thomas Paine,et al. Few-shot Autoregressive Density Estimation: Towards Learning to Learn Distributions , 2017, ICLR.

[45] Sercan Ömer Arik,et al. Neural Voice Cloning with a Few Samples , 2018, NeurIPS.

[46] Quan Wang,et al. Generalized End-to-End Loss for Speaker Verification , 2017, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[47] Sergey Levine,et al. One-Shot Imitation from Observing Humans via Domain-Adaptive Meta-Learning , 2018, Robotics: Science and Systems.

[48] Heiga Zen,et al. Parallel WaveNet: Fast High-Fidelity Speech Synthesis , 2017, ICML.