Exponentiated Gradient Meets Gradient Descent

The (stochastic) gradient descent and the multiplicative update method are probably the most popular algorithms in machine learning. We introduce and study a new regularization which provides a unification of the additive and multiplicative updates. This regularization is derived from an hyperbolic analogue of the entropy function, which we call hypentropy. It is motivated by a natural extension of the multiplicative update to negative numbers. The hypentropy has a natural spectral counterpart which we use to derive a family of matrix-based updates that bridge gradient methods and the multiplicative method for matrices. While the latter is only applicable to positive semi-definite matrices, the spectral hypentropy method can naturally be used with general rectangular matrices. We analyze the new family of updates by deriving tight regret bounds. We study empirically the applicability of the new update for settings such as multiclass learning, in which the parameters constitute a general rectangular matrix.

[1]  Dale Schuurmans,et al.  General Convergence Results for Linear Discriminant Updates , 1997, COLT '97.

[2]  Manfred K. Warmuth,et al.  Exponentiated Gradient Versus Gradient Descent for Linear Predictors , 1997, Inf. Comput..

[3]  Claudio Gentile,et al.  The Robustness of the p-Norm Algorithms , 1999, COLT '99.

[4]  Claudio Gentile,et al.  On the generalization ability of on-line learning algorithms , 2001, IEEE Transactions on Information Theory.

[5]  Gunnar Rätsch,et al.  Matrix Exponentiated Gradient Updates for On-line Learning and Bregman Projection , 2004, J. Mach. Learn. Res..

[6]  Gábor Lugosi,et al.  Prediction, learning, and games , 2006 .

[7]  Sanjeev Arora,et al.  A combinatorial, primal-dual approach to semidefinite programs , 2007, STOC '07.

[8]  Manfred K. Warmuth Winnowing subspaces , 2007, ICML '07.

[9]  A. Juditsky,et al.  Large Deviations of Vector-valued Martingales in 2-Smooth Normed Spaces , 2008, 0809.0813.

[10]  Alex Krizhevsky,et al.  Learning Multiple Layers of Features from Tiny Images , 2009 .

[11]  Ambuj Tewari,et al.  Applications of strong convexity--strong smoothness duality to learning with matrices , 2009, ArXiv.

[12]  Yoram Singer,et al.  Adaptive Subgradient Methods for Online Learning and Stochastic Optimization , 2011, J. Mach. Learn. Res..

[13]  Shai Shalev-Shwartz,et al.  Near-Optimal Algorithms for Online Matrix Prediction , 2012, COLT.

[14]  Ambuj Tewari,et al.  Regularization Techniques for Learning with Matrices , 2009, J. Mach. Learn. Res..

[15]  Shai Shalev-Shwartz,et al.  Online Learning and Online Convex Optimization , 2012, Found. Trends Mach. Learn..

[16]  Sanjeev Arora,et al.  The Multiplicative Weights Update Method: a Meta-Algorithm and Applications , 2012, Theory Comput..

[17]  Yao-Liang Yu The Strong Convexity of von Neumann’s Entropy , 2015 .

[18]  Elad Hazan,et al.  Introduction to Online Convex Optimization , 2016, Found. Trends Optim..