Information geometry in optimization, machine learning and statistical inference

The present article gives an introduction to information geometry and surveys its applications in the area of machine learning, optimization and statistical inference. Information geometry is explained intuitively by using divergence functions introduced in a manifold of probability distributions and other general manifolds. They give a Riemannian structure together with a pair of dual flatness criteria. Many manifolds are dually flat. When a manifold is dually flat, a generalized Pythagorean theorem and related projection theorem are introduced. They provide useful means for various approximation and optimization problems. We apply them to alternative minimization problems, Ying-Yang machines and belief propagation algorithm in machine learning.

[1]  H. Chernoff A Measure of Asymptotic Efficiency for Tests of a Hypothesis Based on the sum of Observations , 1952 .

[2]  S. M. Ali,et al.  A General Class of Coefficients of Divergence of One Distribution from Another , 1966 .

[3]  L. Bregman The relaxation method of finding the common point of convex sets and its application to the solution of problems in convex programming , 1967 .

[4]  Jan Havrda,et al.  Quantification method of classification processes. Concept of structural a-entropy , 1967, Kybernetika.

[5]  D. Rubin,et al.  Maximum likelihood from incomplete data via the EM - algorithm plus discussions on the paper , 1977 .

[6]  N. N. Chent︠s︡ov Statistical decision rules and optimal inference , 1982 .

[7]  S. Eguchi Second Order Efficiency of Minimum Contrast Estimators in a Curved Exponential Family , 1983 .

[8]  Judea Pearl,et al.  Probabilistic reasoning in intelligent systems - networks of plausible inference , 1991, Morgan Kaufmann series in representation and reasoning.

[9]  William J. Byrne,et al.  Alternating minimization and Boltzmann machine learning , 1992, IEEE Trans. Neural Networks.

[10]  Shun-ichi Amari,et al.  Information geometry of Boltzmann machines , 1992, IEEE Trans. Neural Networks.

[11]  Shun-ichi Amari,et al.  Information geometry of the EM and em algorithms for neural networks , 1995, Neural Networks.

[12]  Lei Xu,et al.  Bayesian Ying-Yang machine, clustering and number of clusters , 1997, Pattern Recognit. Lett..

[13]  Shun-ichi Amari,et al.  Natural Gradient Works Efficiently in Learning , 1998, Neural Computation.

[14]  Michael I. Jordan Learning in Graphical Models , 1999, NATO ASI Series.

[15]  Lei Xu,et al.  Bayesian Kullback Ying-Yang dependence reduction theory , 1998, Neurocomputing.

[16]  Lei Xu,et al.  RBF nets, mixture experts, and Bayesian Ying-Yang learning , 1998, Neurocomputing.

[17]  Toshiyuki Tanaka,et al.  Information Geometry of Mean-Field Approximation , 2000, Neural Computation.

[18]  Shun-ichi Amari,et al.  Methods of information geometry , 2000 .

[19]  Lei Xu,et al.  Best Harmony, Unified RPCL and Automated Model Selection for Unsupervised and Supervised Learning on Gaussian Mixtures, Three-Layer Nets and ME-RBF-SVM Models , 2001, Int. J. Neural Syst..

[20]  Lei Xu,et al.  BYY harmony learning, independent state space, and generalized APT financial analyses , 2001, IEEE Trans. Neural Networks.

[21]  M. Opper,et al.  Advanced mean field methods: theory and practice , 2001 .

[22]  Mihoko Minami,et al.  Robust Blind Source Separation by Beta Divergence , 2002, Neural Computation.

[23]  Alan L. Yuille,et al.  CCCP Algorithms to Minimize the Bethe and Kikuchi Free Energies: Convergent Alternatives to Belief Propagation , 2002, Neural Computation.

[24]  J. Copas,et al.  A class of logistic‐type discriminant functions , 2002 .

[25]  Lei Xu,et al.  BYY harmony learning, structural RPCL, and topological self-organizing on mixture models , 2002, Neural Networks.

[26]  Yasuo Matsuyama,et al.  The alpha-EM algorithm: surrogate likelihood maximization using alpha-logarithmic information measures , 2003, IEEE Trans. Inf. Theory.

[27]  K. Drouiche Testing proportionality for autoregressive processes , 2003, IEEE Trans. Inf. Theory.

[28]  Alan L. Yuille,et al.  The Concave-Convex Procedure , 2003, Neural Computation.

[29]  S. Amari,et al.  Mathematical theory on formation of category detecting nerve cells , 1978, Biological Cybernetics.

[30]  Takafumi Kanamori,et al.  Information Geometry of U-Boost and Bregman Divergence , 2004, Neural Computation.

[31]  Shun-ichi Amari,et al.  Stochastic Reasoning, Free Energy, and Information Geometry , 2004, Neural Computation.

[32]  Shun-ichi Amari,et al.  Information geometry of turbo and low-density parity-check codes , 2004, IEEE Transactions on Information Theory.

[33]  S. Amari Integration of Stochastic Models by Minimizing -Divergence , 2007, Neural Computation.

[34]  Shun-ichi Amari,et al.  Information Geometry and Its Applications: Convex Function and Dually Flat Manifold , 2009, ETVC.

[35]  Imre Csiszár,et al.  Axiomatic Characterizations of Information Measures , 2008, Entropy.

[36]  Shun-ichi Amari,et al.  $\alpha$ -Divergence Is Unique, Belonging to Both $f$-Divergence and Bregman Divergence Classes , 2009, IEEE Transactions on Information Theory.

[37]  Frank Nielsen Emerging Trends in Visual Computing, LIX Fall Colloquium, ETVC 2008, Palaiseau, France, November 18-20, 2008. Revised Invited Papers , 2009, etvc.

[38]  Andrzej Cichocki,et al.  Nonnegative Matrix and Tensor Factorization T , 2007 .