Matrix Exponentiated Gradient Updates for On-line Learning and Bregman Projection

We address the problem of learning a symmetric positive definite matrix. The central issue is to design parameter updates that preserve positive definiteness. Our updates are motivated with the von Neumann divergence. Rather than treating the most general case, we focus on two key applications that exemplify our methods: on-line learning with a simple square loss, and finding a symmetric positive definite matrix subject to linear constraints. The updates generalize the exponentiated gradient (EG) update and AdaBoost, respectively: the parameter is now a symmetric positive definite matrix of trace one instead of a probability vector (which in this context is a diagonal positive definite matrix with trace one). The generalized updates use matrix logarithms and exponentials to preserve positive definiteness. Most importantly, we show how the derivation and the analyses of the original EG update and AdaBoost generalize to the non-diagonal case. We apply the resulting matrix exponentiated gradient (MEG) update and DefiniteBoost to the problem of learning a kernel matrix from distance measurements.

[1]  S. Golden LOWER BOUNDS FOR THE HELMHOLTZ FUNCTION , 1965 .

[2]  L. Bregman The relaxation method of finding the common point of convex sets and its application to the solution of problems in convex programming , 1967 .

[3]  Y. Censor,et al.  An iterative row-action method for interval convex programming , 1981 .

[4]  N. Littlestone Learning Quickly When Irrelevant Attributes Abound: A New Linear-Threshold Algorithm , 1987, 28th Annual Symposium on Foundations of Computer Science (sfcs 1987).

[5]  N. Littlestone Mistake bounds and logarithmic linear-threshold learning algorithms , 1990 .

[6]  Yoram Singer,et al.  A Comparison of New and Old Algorithms for a Mixture Estimation Problem , 2004, Machine Learning.

[7]  Yoav Freund,et al.  A decision-theoretic generalization of on-line learning and an application to boosting , 1995, EuroCOLT.

[8]  Yoav Freund,et al.  A decision-theoretic generalization of on-line learning and an application to boosting , 1997, EuroCOLT.

[9]  Manfred K. Warmuth,et al.  Exponentiated Gradient Versus Gradient Descent for Linear Predictors , 1997, Inf. Comput..

[10]  Yoram Singer,et al.  Batch and On-Line Parameter Estimation of Gaussian Mixtures Based on the Joint Entropy , 1998, NIPS.

[11]  Yoram Singer,et al.  Improved Boosting Algorithms Using Confidence-rated Predictions , 1998, COLT' 98.

[12]  Yoram Singer,et al.  Improved Boosting Algorithms Using Confidence-rated Predictions , 1998, COLT' 98.

[13]  Alexander J. Smola,et al.  Learning with kernels , 1998 .

[14]  J. Lafferty Additive models, boosting, and inference for generalized divergences , 1999, COLT '99.

[15]  Manfred K. Warmuth,et al.  Boosting as entropy projection , 1999, COLT '99.

[16]  G. Rätsch Robust Boosting via Convex Optimization , 2001 .

[17]  Gunnar Rätsch,et al.  Maximizing the Margin with Boosting , 2002, COLT.

[18]  Michael I. Jordan,et al.  Distance Metric Learning with Application to Clustering with Side-Information , 2002, NIPS.

[19]  O. Barndorff-Nielsen,et al.  On quantum statistical inference , 2003, quant-ph/0307189.

[20]  Ivor W. Tsang,et al.  Distance metric learning with kernels , 2003 .

[21]  Kiyoshi Asai,et al.  The em Algorithm for Kernel Matrix Completion with Auxiliary Data , 2003, J. Mach. Learn. Res..

[22]  Manfred K. Warmuth,et al.  Relative Loss Bounds for Multidimensional Regression Problems , 1997, Machine Learning.

[23]  B. Ripley,et al.  Robust Statistics , 2018, Wiley Series in Probability and Statistics.

[24]  William Stafford Noble,et al.  Learning kernels from biological networks by maximizing entropy , 2004, ISMB/ECCB.

[25]  Yoram Singer,et al.  Online and batch learning of pseudo-metrics , 2004, ICML.

[26]  Yoram Singer,et al.  A Comparison of New and Old Algorithms for a Mixture Estimation Problem , 1995, COLT '95.

[27]  Philip M. Long,et al.  On-line learning of linear functions , 2005, computational complexity.

[28]  Stephen P. Boyd,et al.  Convex Optimization , 2004, Algorithms and Theory of Computation Handbook.