On the computational complexity of approximating distributions by probabilistic automata

We introduce a rigorous performance criterion for training algorithms for probabilistic automata (PAs) and hidden Markov models (HMMs), used extensively for speech recognition, and analyze the complexity of the training problem as a computational problem. The PA training problem is the problem of approximating an arbitrary, unknown source distribution by distributions generated by a PA. We investigate the following question about this important, well-studied problem: Does there exist anefficient training algorithm such that the trained PAsprovably converge to a model close to an optimum one with high confidence, after only a feasibly small set of training data? We model this problem in the framework of computational learning theory and analyze the sample as well as computational complexity. We show that the number of examples required for training PAs is moderate—except for some log factors the number of examples is linear in the number of transition probabilities to be trained and a low-degree polynomial in the example length and parameters quantifying the accuracy and confidence. Computationally, however, training PAs is quite demanding: Fixed state size PAs are trainable in time polynomial in the accuracy and confidence parameters and example length, butnot in the alphabet size unlessRP=NP. The latter result is shown via a strong non-approximability result for the single string maximum likelihood model probem for 2-state PAs, which is of independent interest.

[1]  John Gill,et al.  Computational Complexity of Probabilistic Turing Machines , 1977, SIAM J. Comput..

[2]  Leslie G. Valiant,et al.  A theory of the learnable , 1984, CACM.

[3]  David Haussler,et al.  Decision Theoretic Generalizations of the PAC Model for Neural Net and Other Learning Applications , 1992, Inf. Comput..

[4]  W.-G. Tseng The equivalence and learning of probabilistic automata , 1989, 30th Annual Symposium on Foundations of Computer Science.

[5]  Robert E. Schapire,et al.  Efficient distribution-free learning of probabilistic concepts , 1990, Proceedings [1990] 31st Annual Symposium on Foundations of Computer Science.

[6]  E. Mark Gold,et al.  Complexity of Automaton Identification from Given Data , 1978, Inf. Control..

[7]  Naoki Abe,et al.  Polynomial learnability of probabilistic concepts with respect to the Kullback-Leibler divergence , 1991, COLT '91.

[8]  Richard W. Hamming,et al.  Coding and Information Theory , 1980 .

[9]  D. Pollard Convergence of stochastic processes , 1984 .

[10]  DANA ANGLUIN,et al.  On the Complexity of Minimum Inference of Regular Sets , 1978, Inf. Control..

[11]  David Haussler,et al.  Learnability and the Vapnik-Chervonenkis dimension , 1989, JACM.

[12]  L. Baum,et al.  An inequality and associated maximization technique in statistical estimation of probabilistic functions of a Markov process , 1972 .

[13]  Andrew R. Barron,et al.  Minimum complexity density estimation , 1991, IEEE Trans. Inf. Theory.

[14]  L. R. Rabiner,et al.  An introduction to the application of the theory of probabilistic functions of a Markov process to automatic speech recognition , 1983, The Bell System Technical Journal.

[15]  Philip D. Laird,et al.  Efficient unsupervised learning , 1988, COLT '88.

[16]  S. Kullback,et al.  A lower bound for discrimination information in terms of variation (Corresp.) , 1967, IEEE Trans. Inf. Theory.

[17]  Kenji Yamanishi,et al.  A learning criterion for stochastic rules , 1990, COLT '90.

[18]  Leonard Pitt,et al.  The minimum consistent DFA problem cannot be approximated within any polynomial , 1989, [1989] Proceedings. Structure in Complexity Theory Fourth Annual Conference.