DIVERGENCE FUNCTION , INFORMATION MONOTONICITY AND INFORMATION GEOMETRY

A divergence function measures how different two points are in a base space. Well-known examples are the Kullback-Leibler divergence and f-divergence, which are defined in a manifold of probability distributions. The Bregman divergence is used in a more general situation. The present paper characterizes the geometrical structure which a divergence function gives, and proves that the fdivergences are unique in the sense of information-invariancy, giving the alpha-geometrical structure. Bregman divergences are characterized by dually flat geometrical structure. The paper also studies geometrical properties of hierarchical models which include singular structure.

[1]  S. M. Ali,et al.  A General Class of Coefficients of Divergence of One Distribution from Another , 1966 .

[2]  Shun-ichi Amari,et al.  Dynamics of Learning in Multilayer Perceptrons Near Singularities , 2008, IEEE Transactions on Neural Networks.

[3]  S. Amari Integration of Stochastic Models by Minimizing -Divergence , 2007, Neural Computation.

[4]  L. Bregman The relaxation method of finding the common point of convex sets and its application to the solution of problems in convex programming , 1967 .

[5]  N. Čencov Statistical Decision Rules and Optimal Inference , 2000 .

[6]  Shun-ichi Amari,et al.  $\alpha$ -Divergence Is Unique, Belonging to Both $f$-Divergence and Bregman Divergence Classes , 2009, IEEE Transactions on Information Theory.

[7]  Inderjit S. Dhillon,et al.  Matrix Nearness Problems with Bregman Divergences , 2007, SIAM J. Matrix Anal. Appl..

[8]  Shun-ichi Amari,et al.  Dynamics of Learning Near Singularities in Layered Networks , 2008, Neural Computation.

[9]  Shun-ichi Amari,et al.  Information Geometry and Its Applications: Convex Function and Dually Flat Manifold , 2009, ETVC.

[10]  D. Petz Monotone metrics on matrix spaces , 1996 .

[11]  S. Amari,et al.  Singularities Affect Dynamics of Learning in Neuromanifolds , 2006, Neural Computation.

[12]  Imre Csiszár,et al.  Axiomatic Characterizations of Information Measures , 2008, Entropy.

[13]  Jan Havrda,et al.  Quantification method of classification processes. Concept of structural a-entropy , 1967, Kybernetika.

[14]  H. Chernoff A Measure of Asymptotic Efficiency for Tests of a Hypothesis Based on the sum of Observations , 1952 .

[15]  Shun-ichi Amari,et al.  Dynamics of learning near singularities in radial basis function networks , 2008, Neural Networks.

[16]  Shun-ichi Amari,et al.  Natural Gradient Works Efficiently in Learning , 1998, Neural Computation.

[17]  Inder Jeet Taneja,et al.  Relative information of type s, Csiszár's f-divergence, and information inequalities , 2004, Inf. Sci..

[18]  Kenji Fukumizu,et al.  Local minima and plateaus in hierarchical structures of multilayer perceptrons , 2000, Neural Networks.

[19]  Yasuo Matsuyama,et al.  The alpha-EM algorithm: surrogate likelihood maximization using alpha-logarithmic information measures , 2003, IEEE Trans. Inf. Theory.

[20]  J. Milnor On the concept of attractor , 1985 .

[21]  A. Rényi On Measures of Entropy and Information , 1961 .

[22]  Shun-ichi Amari,et al.  Methods of information geometry , 2000 .

[23]  Inderjit S. Dhillon,et al.  Clustering with Bregman Divergences , 2005, J. Mach. Learn. Res..

[24]  I. Csiszár Why least squares and maximum entropy? An axiomatic approach to inference for linear inverse problems , 1991 .