论文信息 - Interpolating Between Gradient Descent and Exponentiated Gradient Using Reparameterized Gradient Descent

Interpolating Between Gradient Descent and Exponentiated Gradient Using Reparameterized Gradient Descent

Continuous-time mirror descent (CMD) can be seen as the limit case of the discrete-time MD update when the step-size is infinitesimally small. In this paper, we focus on the geometry of the primal and dual CMD updates and introduce a general framework for reparameterizing one CMD update as another. Specifically, the reparameterized update also corresponds to a CMD, but on the composite loss w.r.t. the new variables, and the original variables are obtained via the reparameterization map. We employ these results to introduce a new family of reparameterizations that interpolate between the two commonly used updates, namely the continuous-time gradient descent (GD) and unnormalized exponentiated gradient (EGU), while extending to many other well-known updates. In particular, we show that for the underdetermined linear regression problem, these updates generalize the known behavior of GD and EGU, and provably converge to the minimum $\mathrm{L}_{2-\tau}$-norm solution for $\tau\in[0,1]$. Our new results also have implications for the regularized training of neural networks to induce sparsity.

Manfred K. Warmuth | Ehsan Amid | E. Amid

[1] Shai Shalev-Shwartz,et al. Online Learning and Online Convex Optimization , 2012, Found. Trends Mach. Learn..

[2] W. L. Burke. Applied Differential Geometry , 1985 .

[3] Babak Hassibi,et al. The p-norm generalization of the LMS algorithm for adaptive filtering , 2003, IEEE Transactions on Signal Processing.

[4] R. Rockafellar. Monotone Operators and the Proximal Point Algorithm , 1976 .

[5] Manfred K. Warmuth,et al. Exponentiated Gradient Versus Gradient Descent for Linear Predictors , 1997, Inf. Comput..

[6] Andrzej Cichocki,et al. Families of Alpha- Beta- and Gamma- Divergences: Flexible and Robust Measures of Similarities , 2010, Entropy.

[7] Manfred K. Warmuth,et al. The Perceptron Algorithm Versus Winnow: Linear Versus Logarithmic Mistake Bounds when Few Input Variables are Relevant (Technical Note) , 1997, Artif. Intell..

[8] S. V. N. Vishwanathan,et al. Leaving the Span , 2005, COLT.

[9] Manfred K. Warmuth,et al. The perceptron algorithm vs. Winnow: linear vs. logarithmic mistake bounds when few input variables are relevant , 1995, COLT '95.

[10] Jiazhong Nie,et al. Online PCA with Optimal Regret , 2016, J. Mach. Learn. Res..

[11] William H. Sandholm,et al. Population Games And Evolutionary Dynamics , 2010, Economic learning and social evolution.

[12] Yoram Singer,et al. Exponentiated Gradient Meets Gradient Descent , 2019, 1902.01903.

[13] Nathan Srebro,et al. Implicit Regularization in Matrix Factorization , 2017, 2018 Information Theory and Applications Workshop (ITA).

[14] Varun Kanade,et al. Implicit Regularization for Optimal Sparse Recovery , 2019, NeurIPS.

[15] Yishay Mansour,et al. Improved second-order bounds for prediction with expert advice , 2006, Machine Learning.

[16] Manfred K. Warmuth,et al. Winnowing with Gradient Descent , 2020, COLT.

[17] Manfred K. Warmuth,et al. The Weighted Majority Algorithm , 1994, Inf. Comput..

[18] J. Naudts. Deformed exponentials and logarithms in generalized thermostatistics , 2002, cond-mat/0203489.

[19] Udaya Ghai,et al. Exponentiated Gradient vs. Meets Gradient Descent , 2020 .

[20] Maxim Raginsky,et al. Continuous-time stochastic Mirror Descent on a network: Variance reduction, consensus, convergence , 2012, 2012 IEEE 51st IEEE Conference on Decision and Control (CDC).

[21] Shun-ichi Amari,et al. Natural Gradient Works Efficiently in Learning , 1998, Neural Computation.

[22] Manfred K. Warmuth,et al. Robust Bi-Tempered Logistic Loss Based on Bregman Divergences , 2019, NeurIPS.

[23] H. Zou,et al. Regularization and variable selection via the elastic net , 2005 .

[24] Manfred K. Warmuth. Winnowing subspaces , 2007, ICML '07.

[25] Sayan Mukherjee,et al. The Information Geometry of Mirror Descent , 2013, IEEE Transactions on Information Theory.

[26] John Darzentas,et al. Problem Complexity and Method Efficiency in Optimization , 1983 .