Maximum likelihood learning of auditory feature maps for stationary vowels

A mathematical framework for learning the acoustic features from a central auditory representation is presented. The authors adopt a statistical approach that models the leaning process as to achieve a maximum likelihood estimation of the signal distribution. An algorithm, called statistical marching pursuit (SMP), is introduced to identify regions on the cortical surface when the features for each sound class are most prominent. They model the features with distributions of Gaussian mixture densities, and employ the expectation-maximization (EM) procedure to both improve the parameterization and refine iteratively the selection of cortical regions from which the features are extracted. The learning algorithm is applied to vowel classification on the TIMIT database where all the vowels (excluding diphthongs, nine in total) are regarded as individual classes. Experimental results show that models trained under the SMP/EM algorithm achieve a recognition accuracy comparable to that of conventional recognizers.

[1]  Kuansan Wang,et al.  Spectral shape analysis in the central auditory system , 1995, IEEE Trans. Speech Audio Process..

[2]  P. O. Bishop,et al.  Spatial vision. , 1971, Annual review of psychology.

[3]  Jont B. Allen,et al.  How do humans process and recognize speech? , 1994, IEEE Trans. Speech Audio Process..

[4]  Athanasios Papoulis,et al.  Probability, Random Variables and Stochastic Processes , 1965 .

[5]  Richard O. Duda,et al.  Pattern classification and scene analysis , 1974, A Wiley-Interscience publication.

[6]  D. Rubin,et al.  Maximum likelihood from incomplete data via the EM - algorithm plus discussions on the paper , 1977 .

[7]  Stéphane Mallat,et al.  Matching pursuits with time-frequency dictionaries , 1993, IEEE Trans. Signal Process..