Reparameterizing Mirror Descent as Gradient Descent

Most of the recent successful applications of neural networks have been based on training with gradient descent updates. However, for some small networks, other mirror descent updates learn provably more efficiently when the target is sparse. We present a general framework for casting a mirror descent update as a gradient descent update on a different set of parameters. In some cases, the mirror descent reparameterization can be described as training a modified network with standard backpropagation. The reparameterization framework is versatile and covers a wide range of mirror descent updates, even cases where the domain is constrained. Our construction for the reparameterization argument is done for the continuous versions of the updates. Finding general criteria for the discrete versions to closely track their continuous counterparts remains an interesting open problem.

[1]  Babak Hassibi,et al.  The p-norm generalization of the LMS algorithm for adaptive filtering , 2003, IEEE Transactions on Signal Processing.

[2]  R. Rockafellar Monotone Operators and the Proximal Point Algorithm , 1976 .

[3]  Nathan Srebro,et al.  Implicit Bias of Gradient Descent on Linear Convolutional Networks , 2018, NeurIPS.

[4]  Kathrin Abendroth,et al.  The Geometry Of Population Genetics , 2016 .

[5]  Maxim Raginsky,et al.  Continuous-time stochastic Mirror Descent on a network: Variance reduction, consensus, convergence , 2012, 2012 IEEE 51st IEEE Conference on Decision and Control (CDC).

[6]  Sayan Mukherjee,et al.  The Information Geometry of Mirror Descent , 2013, IEEE Transactions on Information Theory.

[7]  Manfred K. Warmuth,et al.  Exponentiated Gradient Versus Gradient Descent for Linear Predictors , 1997, Inf. Comput..

[8]  Manfred K. Warmuth,et al.  Robust Bi-Tempered Logistic Loss Based on Bregman Divergences , 2019, NeurIPS.

[9]  H. Zou,et al.  Regularization and variable selection via the elastic net , 2005 .

[10]  W. L. Burke Applied Differential Geometry , 1985 .

[11]  Manfred K. Warmuth,et al.  Winnowing with Gradient Descent , 2020, COLT.

[12]  Shun-ichi Amari,et al.  Natural Gradient Works Efficiently in Learning , 1998, Neural Computation.

[13]  John Darzentas,et al.  Problem Complexity and Method Efficiency in Optimization , 1983 .

[14]  Shai Shalev-Shwartz,et al.  Online Learning and Online Convex Optimization , 2012, Found. Trends Mach. Learn..

[15]  Wojciech Kotlowski,et al.  A case where a spindly two-layer linear network whips any neural network with a fully connected input layer , 2020, ArXiv.

[16]  Yishay Mansour,et al.  Improved second-order bounds for prediction with expert advice , 2006, Machine Learning.

[17]  Andrzej Cichocki,et al.  Families of Alpha- Beta- and Gamma- Divergences: Flexible and Robust Measures of Similarities , 2010, Entropy.

[18]  Manfred K. Warmuth,et al.  The Weighted Majority Algorithm , 1994, Inf. Comput..

[19]  J. Naudts Deformed exponentials and logarithms in generalized thermostatistics , 2002, cond-mat/0203489.

[20]  Manfred K. Warmuth,et al.  The Perceptron Algorithm Versus Winnow: Linear Versus Logarithmic Mistake Bounds when Few Input Variables are Relevant (Technical Note) , 1997, Artif. Intell..

[21]  S. V. N. Vishwanathan,et al.  Leaving the Span , 2005, COLT.

[22]  Manfred K. Warmuth,et al.  The perceptron algorithm vs. Winnow: linear vs. logarithmic mistake bounds when few input variables are relevant , 1995, COLT '95.

[23]  Jiazhong Nie,et al.  Online PCA with Optimal Regret , 2016, J. Mach. Learn. Res..

[24]  William H. Sandholm,et al.  Population Games And Evolutionary Dynamics , 2010, Economic learning and social evolution.

[25]  Yoram Singer,et al.  Exponentiated Gradient Meets Gradient Descent , 2019, 1902.01903.

[26]  Nathan Srebro,et al.  Implicit Regularization in Matrix Factorization , 2017, 2018 Information Theory and Applications Workshop (ITA).

[27]  Varun Kanade,et al.  Implicit Regularization for Optimal Sparse Recovery , 2019, NeurIPS.