Binary coding of speech spectrograms using a deep auto-encoder

This paper reports our recent exploration of the layer-by-layer learning strategy for training a multi-layer generative model of patches of speech spectrograms. The top layer of the generative model learns binary codes that can be used for efficient compression of speech and could also be used for scalable speech recognition or rapid speech content retrieval. Each layer of the generative model is fully connected to the layer below and the weights on these connections are pretrained efficiently by using the contrastive divergence approximation to the log likelihood gradient. After layer-bylayer pre-training we “unroll” the generative model to form a deep auto-encoder, whose parameters are then fine-tuned using back-propagation. To reconstruct the full-length speech spectrogram, individual spectrogram segments predicted by their respective binary codes are combined using an overlapand-add method. Experimental results on speech spectrogram coding demonstrate that the binary codes produce a logspectral distortion that is approximately 2 dB lower than a subband vector quantization technique over the entire frequency range of wide-band speech. Index Terms: deep learning, speech feature extraction, neural networks, auto-encoder, binary codes, Boltzmann machine

[1]  K. Stevens,et al.  Reduction of Speech Spectra by Analysis‐by‐Synthesis Techniques , 1961 .

[2]  Kenneth N. Stevens,et al.  Speech recognition: A model and a program for research , 1962, IRE Trans. Inf. Theory.

[3]  Li Deng,et al.  Computational Models for Speech Production , 2018, Speech Processing.

[4]  Geoffrey E. Hinton,et al.  Reducing the Dimensionality of Data with Neural Networks , 2006, Science.

[5]  Dong Yu,et al.  Structured speech modeling , 2006, IEEE Transactions on Audio, Speech, and Language Processing.

[6]  Yee Whye Teh,et al.  A Fast Learning Algorithm for Deep Belief Nets , 2006, Neural Computation.

[7]  Wu Chou,et al.  Discriminative learning in sequential pattern recognition , 2008, IEEE Signal Processing Magazine.

[8]  Dong Yu,et al.  Solving nonlinear estimation problems using splines [Lecture Notes] , 2009, IEEE Signal Processing Magazine.

[9]  James Glass,et al.  Research Developments and Directions in Speech Recognition and Understanding, Part 1 , 2009 .

[10]  Li Deng,et al.  Learning in the Deep-Structured Conditional Random Fields , 2009 .

[11]  Yifan Gong,et al.  A Novel Framework and Training Algorithm for Variable-Parameter Hidden Markov Models , 2009, IEEE Transactions on Audio, Speech, and Language Processing.

[12]  James R. Glass,et al.  Developments and directions in speech recognition and understanding, Part 1 [DSP Education] , 2009, IEEE Signal Processing Magazine.

[13]  Geoffrey E. Hinton,et al.  Semantic hashing , 2009, Int. J. Approx. Reason..

[14]  Geoffrey E. Hinton,et al.  Deep Belief Networks for phone recognition , 2009 .

[15]  James R. Glass,et al.  Updated Minds Report on Speech Recognition and Understanding, Part 2 Citation Baker, J. Et Al. " Updated Minds Report on Speech Recognition and Understanding, Part 2 [dsp Education]. " Signal Processing Accessed Terms of Use , 2022 .

[16]  Dong Yu,et al.  Solving Nonlinear Estimation Problems Using Splines , 2009 .

[17]  Dong Yu,et al.  Deep-structured hidden conditional random fields for phonetic recognition , 2010, INTERSPEECH.

[18]  Dong Yu,et al.  Language recognition using deep-structured conditional random fields , 2010, 2010 IEEE International Conference on Acoustics, Speech and Signal Processing.

[19]  Dong Yu,et al.  Investigation of full-sequence training of deep belief networks for speech recognition , 2010, INTERSPEECH.