Winnowing with Gradient Descent

The performance of multiplicative updates is typically logarithmic in the number of features when the targets are sparse. Strikingly, we show that the same property can also be achieved with gradient descent (GD) updates. We obtain this result by rewriting the non-negative weights wi of multiplicative updates by ρ 2 i and then performing a gradient descent step w.r.t. the new ρi parameters. We apply this method to the Winnow update, the Hedge update, and the unnormalized and normalized exponentiated gradient (EG) updates for linear regression. When the original weights wi are scaled to sum to one (as done for Hedge and normalized EG), then in the corresponding reparameterized update, the ρi parameters are now divided by ‖ρ‖2 after the gradient descent step. We show that these reparameterizations closely track the original multiplicative updates by proving in each case the same online regret bounds (albeit in some cases, with slightly different constants). As a side, our work exhibits a simple two-layer linear neural network that, when trained with gradient descent, can solve a certain sparse linear problem (known as the Hadamard problem) with exponentially fewer examples than any kernel method.

[1]  Manfred K. Warmuth Winnowing subspaces , 2007, ICML '07.

[2]  Manfred K. Warmuth,et al.  Interpolating Between Gradient Descent and Exponentiated Gradient Using Reparameterized Gradient Descent , 2020, ArXiv.

[3]  Manfred K. Warmuth,et al.  The Perceptron Algorithm Versus Winnow: Linear Versus Logarithmic Mistake Bounds when Few Input Variables are Relevant (Technical Note) , 1997, Artif. Intell..

[4]  Yoav Freund,et al.  A decision-theoretic generalization of on-line learning and an application to boosting , 1995, EuroCOLT.

[5]  Manfred K. Warmuth,et al.  The limits of squared Euclidean distance regularization , 2014, NIPS.

[6]  Shai Shalev-Shwartz,et al.  Online Learning and Online Convex Optimization , 2012, Found. Trends Mach. Learn..

[7]  S. V. N. Vishwanathan,et al.  Leaving the Span , 2005, COLT.

[8]  Babak Hassibi,et al.  The p-norm generalization of the LMS algorithm for adaptive filtering , 2003, IEEE Transactions on Signal Processing.

[9]  Manfred K. Warmuth,et al.  Exponentiated Gradient Versus Gradient Descent for Linear Predictors , 1997, Inf. Comput..

[10]  William H. Sandholm,et al.  Population Games And Evolutionary Dynamics , 2010, Economic learning and social evolution.

[11]  Nathan Srebro,et al.  Implicit Regularization in Matrix Factorization , 2017, 2018 Information Theory and Applications Workshop (ITA).

[12]  Varun Kanade,et al.  Implicit Regularization for Optimal Sparse Recovery , 2019, NeurIPS.

[13]  Manfred K. Warmuth,et al.  The Weighted Majority Algorithm , 1994, Inf. Comput..

[14]  Vladimir Vovk,et al.  Aggregating strategies , 1990, COLT '90.

[15]  Nathan Srebro,et al.  Kernel and Deep Regimes in Overparametrized Models , 2019, ArXiv.

[16]  Kathrin Abendroth,et al.  The Geometry Of Population Genetics , 2016 .

[17]  N. Littlestone Learning Quickly When Irrelevant Attributes Abound: A New Linear-Threshold Algorithm , 1987, 28th Annual Symposium on Foundations of Computer Science (sfcs 1987).

[18]  Yishay Mansour,et al.  Improved second-order bounds for prediction with expert advice , 2006, Machine Learning.

[19]  David Haussler,et al.  Sequential Prediction of Individual Sequences Under General Loss Functions , 1998, IEEE Trans. Inf. Theory.