A Large-Deviation Analysis of the Maximum-Likelihood Learning of Markov Tree Structures

The problem of maximum-likelihood (ML) estimation of discrete tree-structured distributions is considered. Chow and Liu established that ML-estimation reduces to the construction of a maximum-weight spanning tree using the empirical mutual information quantities as the edge weights. Using the theory of large-deviations, we analyze the exponent associated with the error probability of the event that the ML-estimate of the Markov tree structure differs from the true tree structure, given a set of independently drawn samples. By exploiting the fact that the output of ML-estimation is a tree, we establish that the error exponent is equal to the exponential rate of decay of a single dominant crossover event. We prove that in this dominant crossover event, a non-neighbor node pair replaces a true edge of the distribution that is along the path of edges in the true tree graph connecting the nodes in the non-neighbor pair. Using ideas from Euclidean information theory, we then analyze the scenario of ML-estimation in the very noisy learning regime and show that the error exponent can be approximated as a ratio, which is interpreted as the signal-to-noise ratio (SNR) for learning tree distributions. We show via numerical experiments that in this regime, our SNR approximation is accurate.

[1]  A. Wald,et al.  On the Statistical Treatment of Linear Stochastic Difference Equations , 1943 .

[2]  J. Kruskal On the shortest spanning subtree of a graph and the traveling salesman problem , 1956 .

[3]  R. Prim Shortest connection networks and some generalizations , 1957 .

[4]  Amiel Feinstein,et al.  Information and information stability of random variables and processes , 1964 .

[5]  W. Rudin Principles of mathematical analysis , 1964 .

[6]  Robert B. Ash,et al.  Information Theory , 2020, The SAGE International Encyclopedia of Mass Media and Society.

[7]  C. N. Liu,et al.  Approximating discrete probability distributions with dependence trees , 1968, IEEE Trans. Inf. Theory.

[8]  Harry L. Van Trees,et al.  Detection, Estimation, and Modulation Theory, Part I , 1968 .

[9]  Patrick Billingsley,et al.  Weak convergence of measures - applications in probability , 1971, CBMS-NSF regional conference series in applied mathematics.

[10]  Terry J. Wagner,et al.  Consistency of an estimate of tree-dependent probability distributions (Corresp.) , 1973, IEEE Trans. Inf. Theory.

[11]  P. J. Huber,et al.  Minimax Tests and the Neyman-Pearson Lemma for Capacities , 1973 .

[12]  G. Schwarz Estimating the Dimension of a Model , 1978 .

[13]  A. Kester,et al.  Large Deviations of Estimators , 1986 .

[14]  Thomas M. Cover,et al.  Elements of Information Theory , 2005 .

[15]  Ofer Zeitouni,et al.  On universal hypotheses testing via large deviations , 1991, IEEE Trans. Inf. Theory.

[16]  R. Tibshirani Regression Shrinkage and Selection via the Lasso , 1996 .

[17]  Kathryn Fraughnaugh,et al.  Introduction to graph theory , 1973, Mathematical Gazette.

[18]  Amir Dembo,et al.  Large Deviations Techniques and Applications , 1998 .

[19]  Michael I. Jordan Graphical Models , 1998 .

[20]  Shun-ichi Amari,et al.  Methods of information geometry , 2000 .

[21]  A. Antos,et al.  Convergence properties of functional estimates for discrete distributions , 2001 .

[22]  Marcus Hutter,et al.  Distribution of Mutual Information , 2001, NIPS.

[23]  Michael I. Jordan,et al.  Thin Junction Trees , 2001, NIPS.

[24]  Thomas H. Cormen,et al.  Introduction to algorithms [2nd ed.] , 2001 .

[25]  David R. Karger,et al.  Learning Markov networks: maximum bounded tree-width graphs , 2001, SODA '01.

[26]  Shun-ichi Amari,et al.  Information geometry on hierarchy of probability distributions , 2001, IEEE Trans. Inf. Theory.

[27]  Liam Paninski,et al.  Estimation of Entropy and Mutual Information , 2003, Neural Computation.

[28]  Imre Csiszár,et al.  Information projections revisited , 2000, IEEE Trans. Inf. Theory.

[29]  Miroslav Dudík,et al.  Performance Guarantees for Regularized Maximum Entropy Density Estimation , 2004, COLT.

[30]  J. Chazottes,et al.  Large deviations for empirical entropies of g-measures , 2004, math/0406083.

[31]  Lizhong Zheng,et al.  I-Projection and the Geometry of Error Exponents , 2006 .

[32]  Sean P. Meyn,et al.  Worst-case large-deviation asymptotics with application to queueing and information theory , 2006 .

[33]  N. Meinshausen,et al.  High-dimensional graphs and variable selection with the Lasso , 2006, math/0608017.

[34]  Martin J. Wainwright,et al.  High-Dimensional Graphical Model Selection Using ℓ1-Regularized Logistic Regression , 2006, NIPS.

[35]  Eytan Domany,et al.  On the Number of Samples Needed to Learn the Correct Structure of a Bayesian Network , 2006, UAI.

[36]  J. N. Laneman On the Distribution of Mutual Information , 2006 .

[37]  Daphne Koller,et al.  Efficient Structure Learning of Markov Networks using L1-Regularization , 2006, NIPS.

[38]  Carlos Guestrin,et al.  Efficient Principled Learning of Thin Junction Trees , 2007, NIPS.

[39]  Thomas Hofmann,et al.  Efficient Structure Learning of Markov Networks using L1-Regularization , 2007 .

[40]  Venkat Chandrasekaran,et al.  Learning Markov Structure by Maximum Entropy Relaxation , 2007, AISTATS.

[41]  Richard E. Neapolitan,et al.  Learning Bayesian networks , 2007, KDD '07.

[42]  B. Schölkopf,et al.  High-Dimensional Graphical Model Selection Using ℓ1-Regularized Logistic Regression , 2007 .

[43]  Lizhong Zheng,et al.  Euclidean Information Theory , 2008, 2008 IEEE International Zurich Seminar on Communications.

[44]  Lizhong Zheng,et al.  Linear universal decoding for compound channels: an Euclidean Geometric Approach , 2008, 2008 IEEE International Symposium on Information Theory.

[45]  Lang Tong,et al.  A large-deviation analysis for the maximum likelihood learning of tree structures , 2009, 2009 IEEE International Symposium on Information Theory.

[46]  S. Varadhan,et al.  Large deviations , 2019, Graduate Studies in Mathematics.

[47]  Vincent Y. F. Tan,et al.  Learning Gaussian Tree Models: Analysis of Error Exponents and Extremal Structures , 2009, IEEE Transactions on Signal Processing.

[48]  Vincent Y. F. Tan,et al.  Error exponents for composite hypothesis testing of Markov forest distributions , 2010, 2010 IEEE International Symposium on Information Theory.

[49]  Imre Csiszár,et al.  Information Theory - Coding Theorems for Discrete Memoryless Systems, Second Edition , 2011 .

[50]  Vincent Y. F. Tan,et al.  Learning High-Dimensional Markov Forest Distributions: Analysis of Error Rates , 2010, J. Mach. Learn. Res..

[51]  Vincent Y. F. Tan,et al.  Learning Latent Tree Graphical Models , 2010, J. Mach. Learn. Res..

[52]  Sean P. Meyn,et al.  Universal and Composite Hypothesis Testing via Mismatched Divergence , 2011, IEEE Transactions on Information Theory.