Divergence-Based Motivation for Online EM and Combining Hidden Variable Models

Expectation-Maximization (EM) is a prominent approach for parameter estimation of hidden (aka latent) variable models. Given the full batch of data, EM forms an upper-bound of the negative log-likelihood of the model at each iteration and updates to the minimizer of this upper-bound. We first provide a "model level" interpretation of the EM upper-bound as sum of relative entropy divergences to a set of singleton models, induced by the set of observations. Our alternative motivation unifies the "observation level" and the "model level" view of the EM. As a result, we formulate an online version of the EM algorithm by adding an analogous inertia term which corresponds to the relative entropy divergence to the old model. Our motivation is more widely applicable than the previous approaches and leads to simple online updates for mixture of exponential distributions, hidden Markov models, and the first known online update for Kalman filters. Additionally, the finite sample form of the inertia term lets us derive online updates when there is no closed-form solution. Finally, we extend the analysis to the distributed setting where we motivate a systematic way of combining multiple hidden variable models. Experimentally, we validate the results on synthetic as well as real-world datasets.

[1]  Manfred K. Warmuth,et al.  Exponentiated Gradient Versus Gradient Descent for Linear Predictors , 1997, Inf. Comput..

[2]  Michael I. Jordan,et al.  Graphical Models, Exponential Families, and Variational Inference , 2008, Found. Trends Mach. Learn..

[3]  Geoffrey E. Hinton,et al.  A View of the Em Algorithm that Justifies Incremental, Sparse, and other Variants , 1998, Learning in Graphical Models.

[4]  J. L. Roux An Introduction to the Kalman Filter , 2003 .

[5]  John B. Moore,et al.  On-line identification of hidden Markov models via recursive prediction error techniques , 1994, IEEE Trans. Signal Process..

[6]  Maya R. Gupta,et al.  Theory and Use of the EM Algorithm , 2011, Found. Trends Signal Process..

[7]  Shun-ichi Amari,et al.  Methods of information geometry , 2000 .

[8]  Geoffrey E. Hinton,et al.  Parameter estimation for linear dynamical systems , 1996 .

[9]  Yoram Singer,et al.  Training Algorithms for Hidden Markov Models using Entropy Based Distance Functions , 1996, NIPS.

[10]  L. Bregman The relaxation method of finding the common point of convex sets and its application to the solution of problems in convex programming , 1967 .

[11]  Manfred K. Warmuth,et al.  Inline updates for HMMs , 2003, INTERSPEECH.

[12]  Leon Hirsch,et al.  Fundamentals Of Convex Analysis , 2016 .

[13]  David M. Blei,et al.  Variational Inference: A Review for Statisticians , 2016, ArXiv.

[14]  Eiji Takimoto,et al.  On-Line Estimation of Hidden Markov Model Parameters , 2000, Discovery Science.

[15]  L. Gerencsér,et al.  Recursive estimation of Hidden Markov Models , 2005, Proceedings of the 44th IEEE Conference on Decision and Control.

[16]  Pierre Baldi,et al.  Smooth On-Line Learning Algorithms for Hidden Markov Models , 1994, Neural Computation.

[17]  Peter N. Yianilos,et al.  Learning String-Edit Distance , 1996, IEEE Trans. Pattern Anal. Mach. Intell..

[18]  Olivier Capp'e Online EM Algorithm for Hidden Markov Models , 2009, 0908.2359.

[19]  Shaojun Wang,et al.  Almost sure convergence of Titterington's recursive estimator for mixture models , 2002, Proceedings IEEE International Symposium on Information Theory,.

[20]  New York Dover,et al.  ON THE CONVERGENCE PROPERTIES OF THE EM ALGORITHM , 1983 .

[21]  I JordanMichael,et al.  Graphical Models, Exponential Families, and Variational Inference , 2008 .

[22]  Yoram Singer,et al.  Batch and On-Line Parameter Estimation of Gaussian Mixtures Based on the Joint Entropy , 1998, NIPS.

[23]  Andrzej Cichocki,et al.  Nonnegative Matrix and Tensor Factorization T , 2007 .

[24]  Eric Moulines,et al.  Quasi-Newton method for maximum likelihood estimation of hidden Markov models , 1998, Proceedings of the 1998 IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP '98 (Cat. No.98CH36181).

[25]  Conrad Sanderson,et al.  An open source C++ implementation of multi-threaded Gaussian mixture models, k-means and expectation maximisation , 2017, 2017 11th International Conference on Signal Processing and Communication Systems (ICSPCS).

[26]  Susan M. Bridges,et al.  Incremental Estimation of Discrete Hidden Markov Models Based on a New Backward Procedure , 2005, AAAI.

[27]  Aryeh Kontorovich,et al.  On learning parametric-output HMMs , 2013, ICML.

[28]  Gianluigi Mongillo,et al.  Online Learning with Hidden Markov Models , 2008, Neural Computation.

[29]  Lawrence R. Rabiner,et al.  A tutorial on hidden Markov models and selected applications in speech recognition , 1989, Proc. IEEE.

[30]  John B. Moore,et al.  On-line estimation of hidden Markov model parameters based on the Kullback-Leibler information measure , 1993, IEEE Trans. Signal Process..

[31]  Charles Elkan,et al.  Expectation Maximization Algorithm , 2010, Encyclopedia of Machine Learning.

[32]  O. Cappé,et al.  On‐line expectation–maximization algorithm for latent data models , 2009 .

[33]  Mi Bouaricha,et al.  Nonlinear Equations , 2000 .

[34]  D. Titterington Recursive Parameter Estimation Using Incomplete Data , 1984 .

[35]  Vicenç Gómez,et al.  A unified view of entropy-regularized Markov decision processes , 2017, ArXiv.