论文信息 - A Large-Deviation Analysis of the Maximum-Likelihood Learning of Markov Tree Structures

A Large-Deviation Analysis of the Maximum-Likelihood Learning of Markov Tree Structures

The problem of maximum-likelihood (ML) estimation of discrete tree-structured distributions is considered. Chow and Liu established that ML-estimation reduces to the construction of a maximum-weight spanning tree using the empirical mutual information quantities as the edge weights. Using the theory of large-deviations, we analyze the exponent associated with the error probability of the event that the ML-estimate of the Markov tree structure differs from the true tree structure, given a set of independently drawn samples. By exploiting the fact that the output of ML-estimation is a tree, we establish that the error exponent is equal to the exponential rate of decay of a single dominant crossover event. We prove that in this dominant crossover event, a non-neighbor node pair replaces a true edge of the distribution that is along the path of edges in the true tree graph connecting the nodes in the non-neighbor pair. Using ideas from Euclidean information theory, we then analyze the scenario of ML-estimation in the very noisy learning regime and show that the error exponent can be approximated as a ratio, which is interpreted as the signal-to-noise ratio (SNR) for learning tree distributions. We show via numerical experiments that in this regime, our SNR approximation is accurate.

[1] A. Wald,et al. On the Statistical Treatment of Linear Stochastic Difference Equations , 1943 .

[2] J. Kruskal. On the shortest spanning subtree of a graph and the traveling salesman problem , 1956 .

[3] R. Prim. Shortest connection networks and some generalizations , 1957 .

[4] Amiel Feinstein,et al. Information and information stability of random variables and processes , 1964 .

[5] W. Rudin. Principles of mathematical analysis , 1964 .

[6] Robert B. Ash,et al. Information Theory , 2020, The SAGE International Encyclopedia of Mass Media and Society.

[7] C. N. Liu,et al. Approximating discrete probability distributions with dependence trees , 1968, IEEE Trans. Inf. Theory.

[8] Harry L. Van Trees,et al. Detection, Estimation, and Modulation Theory, Part I , 1968 .

[9] Patrick Billingsley,et al. Weak convergence of measures - applications in probability , 1971, CBMS-NSF regional conference series in applied mathematics.

[10] Terry J. Wagner,et al. Consistency of an estimate of tree-dependent probability distributions (Corresp.) , 1973, IEEE Trans. Inf. Theory.

[11] P. J. Huber,et al. Minimax Tests and the Neyman-Pearson Lemma for Capacities , 1973 .

[12] G. Schwarz. Estimating the Dimension of a Model , 1978 .

[13] A. Kester,et al. Large Deviations of Estimators , 1986 .

[14] Thomas M. Cover,et al. Elements of Information Theory , 2005 .

[15] Ofer Zeitouni,et al. On universal hypotheses testing via large deviations , 1991, IEEE Trans. Inf. Theory.

[16] R. Tibshirani. Regression Shrinkage and Selection via the Lasso , 1996 .

[17] Kathryn Fraughnaugh,et al. Introduction to graph theory , 1973, Mathematical Gazette.

[18] Amir Dembo,et al. Large Deviations Techniques and Applications , 1998 .

[19] Michael I. Jordan. Graphical Models , 1998 .

[20] Shun-ichi Amari,et al. Methods of information geometry , 2000 .

[21] A. Antos,et al. Convergence properties of functional estimates for discrete distributions , 2001 .

[22] Marcus Hutter,et al. Distribution of Mutual Information , 2001, NIPS.

[23] Michael I. Jordan,et al. Thin Junction Trees , 2001, NIPS.

[24] Thomas H. Cormen,et al. Introduction to algorithms [2nd ed.] , 2001 .

[25] David R. Karger,et al. Learning Markov networks: maximum bounded tree-width graphs , 2001, SODA '01.

[26] Shun-ichi Amari,et al. Information geometry on hierarchy of probability distributions , 2001, IEEE Trans. Inf. Theory.

[27] Liam Paninski,et al. Estimation of Entropy and Mutual Information , 2003, Neural Computation.

[28] Imre Csiszár,et al. Information projections revisited , 2000, IEEE Trans. Inf. Theory.

[29] Miroslav Dudík,et al. Performance Guarantees for Regularized Maximum Entropy Density Estimation , 2004, COLT.

[30] J. Chazottes,et al. Large deviations for empirical entropies of g-measures , 2004, math/0406083.

[31] Lizhong Zheng,et al. I-Projection and the Geometry of Error Exponents , 2006 .

[32] Sean P. Meyn,et al. Worst-case large-deviation asymptotics with application to queueing and information theory , 2006 .

[33] N. Meinshausen,et al. High-dimensional graphs and variable selection with the Lasso , 2006, math/0608017.

[34] Martin J. Wainwright,et al. High-Dimensional Graphical Model Selection Using ℓ1-Regularized Logistic Regression , 2006, NIPS.

[35] Eytan Domany,et al. On the Number of Samples Needed to Learn the Correct Structure of a Bayesian Network , 2006, UAI.

[36] J. N. Laneman. On the Distribution of Mutual Information , 2006 .

[37] Daphne Koller,et al. Efficient Structure Learning of Markov Networks using L1-Regularization , 2006, NIPS.

[38] Carlos Guestrin,et al. Efficient Principled Learning of Thin Junction Trees , 2007, NIPS.

[39] Thomas Hofmann,et al. Efficient Structure Learning of Markov Networks using L1-Regularization , 2007 .

[40] Venkat Chandrasekaran,et al. Learning Markov Structure by Maximum Entropy Relaxation , 2007, AISTATS.

[41] Richard E. Neapolitan,et al. Learning Bayesian networks , 2007, KDD '07.

[42] B. Schölkopf,et al. High-Dimensional Graphical Model Selection Using ℓ1-Regularized Logistic Regression , 2007 .

[43] Lizhong Zheng,et al. Euclidean Information Theory , 2008, 2008 IEEE International Zurich Seminar on Communications.

[44] Lizhong Zheng,et al. Linear universal decoding for compound channels: an Euclidean Geometric Approach , 2008, 2008 IEEE International Symposium on Information Theory.

[45] Lang Tong,et al. A large-deviation analysis for the maximum likelihood learning of tree structures , 2009, 2009 IEEE International Symposium on Information Theory.

[46] S. Varadhan,et al. Large deviations , 2019, Graduate Studies in Mathematics.

[47] Vincent Y. F. Tan,et al. Learning Gaussian Tree Models: Analysis of Error Exponents and Extremal Structures , 2009, IEEE Transactions on Signal Processing.

[48] Vincent Y. F. Tan,et al. Error exponents for composite hypothesis testing of Markov forest distributions , 2010, 2010 IEEE International Symposium on Information Theory.

[49] Imre Csiszár,et al. Information Theory - Coding Theorems for Discrete Memoryless Systems, Second Edition , 2011 .

[50] Vincent Y. F. Tan,et al. Learning High-Dimensional Markov Forest Distributions: Analysis of Error Rates , 2010, J. Mach. Learn. Res..

[51] Vincent Y. F. Tan,et al. Learning Latent Tree Graphical Models , 2010, J. Mach. Learn. Res..

[52] Sean P. Meyn,et al. Universal and Composite Hypothesis Testing via Mismatched Divergence , 2011, IEEE Transactions on Information Theory.