A comparison of new and old algorithms for a mixture estimation problem

We investigate the problem of estimating the proportion vector which maximizes the likelihood of a given sample for a mixture of given densities. We adapt a framework developed for supervised learning and give simple derivations for many of the standard iterative algorithms like gradient projection and EM. In this framework, the distance between the new and old proportion vectors is used as a penalty term. The square distance leads to the gradient projection update, and the relative entropy to a new update which we call the exponentiated gradient update (EG_\eta). Curiously, when a second order Taylor expansion of the relative entropy is used, we arrive at an update EM_\eta which, for \eta=1, gives the usual EM update. Experimentally, both the EM_{\eta}-update and the EG_\eta-update for \eta > 1 outperform the EM algorithm and its variants. We also prove a polynomial bound on the rate of convergence of the EG_\eta algorithm.

[1]  Richard O. Duda,et al.  Pattern classification and scene analysis , 1974, A Wiley-Interscience publication.

[2]  D. Rubin,et al.  Maximum likelihood from incomplete data via the EM - algorithm plus discussions on the paper , 1977 .

[3]  H. Walker,et al.  An iterative procedure for obtaining maximum-likelihood estimates of the parameters for a mixture of normal distributions , 1978 .

[4]  H. Walker,et al.  THE NUMERICAL EVALUATION OF THE MAXIMUM-LIKELIHOOD ESTIMATE OF A SUBSET OF MIXTURE PROPORTIONS* , 1978 .

[5]  Gene H. Golub,et al.  Matrix computations , 1983 .

[6]  R. Redner,et al.  Mixture densities, maximum likelihood, and the EM algorithm , 1984 .

[7]  David G. Luenberger,et al.  Linear and nonlinear programming , 1984 .

[8]  N. Littlestone Learning Quickly When Irrelevant Attributes Abound: A New Linear-Threshold Algorithm , 1987, 28th Annual Symposium on Foundations of Computer Science (sfcs 1987).

[9]  John Scott Bridle,et al.  Probabilistic Interpretation of Feedforward Classification Network Outputs, with Relationships to Statistical Pattern Recognition , 1989, NATO Neurocomputing.

[10]  Naoki Abe,et al.  Polynomial learnability of probabilistic concepts with respect to the Kullback-Leibler divergence , 1991, COLT '91.

[11]  Steven J. Nowlan,et al.  Soft competitive adaptation: neural network learning algorithms based on fitting statistical mixtures , 1991 .

[12]  Radford M. Neal A new view of the EM algorithm that justifies incremental and other variants , 1993 .

[13]  J. Q. Smith,et al.  1. Bayesian Statistics 4 , 1993 .

[14]  Manfred K. Warmuth,et al.  Additive versus exponentiated gradient updates for linear prediction , 1995, STOC '95.

[15]  Shun-ichi Amari,et al.  The EM Algorithm and Information Geometry in Neural Network Learning , 1995, Neural Computation.

[16]  Manfred K. Warmuth,et al.  The perceptron algorithm vs. Winnow: linear vs. logarithmic mistake bounds when few input variables are relevant , 1995, COLT '95.

[17]  Yoram Singer,et al.  On‐Line Portfolio Selection Using Multiplicative Updates , 1998, ICML.

[18]  Yoram Singer,et al.  Training Algorithms for Hidden Markov Models using Entropy Based Distance Functions , 1996, NIPS.