Additive versus exponentiated gradient updates for linear prediction

We consider two algorithms for on-line prediction based on a linear model. The algorithms are the well-known Gradient Descent (GD) algorithm and a new algorithm, which we call EG *. They both maintain a weight vector using simple updates. For the GD algorithm, the weight vector is updated by subtracting from it the gradient of the squared error made on a prediction multiplied by a parameter called the learning rate. The EG* uses the components of the gradient in the exponents of factors that are used in updating the weight vector multiplicatively. We present worst-case on-line loss bounds for EG* and compare them to previously known bounds for the GD algorithm. The bounds suggest that although the on-line losses of the algorithms are in general incomparable, EG * has a much smaller loss if only few of the input variables are relevant for the predictions. Experiments show that the worst-case upper bounds are quite tight already on simple artificial data. Our main methodological idea is using a distance function between weight vectors both in motivating the algorithms and as a potential function in an amortized analysis that leads to worst-case loss bounds. Using squared Euclidean distance leads to the GD algorithm, and using the relative entropy leads to the EG* algorithm.

[1]  F ROSENBLATT,et al.  The perceptron: a probabilistic model for information storage and organization in the brain. , 1958, Psychological review.

[2]  Vladimir Vapnik,et al.  Chervonenkis: On the uniform convergence of relative frequencies of events to their probabilities , 1971 .

[3]  Richard O. Duda,et al.  Pattern classification and scene analysis , 1974, A Wiley-Interscience publication.

[4]  Bernard Widrow,et al.  Adaptive Signal Processing , 1985 .

[5]  David Haussler,et al.  Learnability and the Vapnik-Chervonenkis dimension , 1989, JACM.

[6]  Vladimir Vovk,et al.  Aggregating strategies , 1990, COLT '90.

[7]  N. Littlestone Mistake bounds and logarithmic linear-threshold learning algorithms , 1990 .

[8]  Bernhard E. Boser,et al.  A training algorithm for optimal margin classifiers , 1992, COLT '92.

[9]  David Haussler,et al.  How to use expert advice , 1993, STOC.

[10]  Philip M. Long,et al.  WORST-CASE QUADRATIC LOSS BOUNDS FOR ON-LINE PREDICTION OF LINEAR FUNCTIONS BY GRADIENT DESCENT , 1993 .

[11]  S. Hyakin,et al.  Neural Networks: A Comprehensive Foundation , 1994 .

[12]  Manfred K. Warmuth,et al.  The Weighted Majority Algorithm , 1994, Inf. Comput..

[13]  David Haussler,et al.  Tight worst-case loss bounds for predicting with expert advice , 1994, EuroCOLT.

[14]  Shun-ichi Amari,et al.  Information geometry of the EM and em algorithms for neural networks , 1995, Neural Networks.

[15]  Shun-ichi Amari,et al.  The EM Algorithm and Information Geometry in Neural Network Learning , 1995, Neural Computation.

[16]  Manfred K. Warmuth,et al.  The perceptron algorithm vs. Winnow: linear vs. logarithmic mistake bounds when few input variables are relevant , 1995, COLT '95.

[17]  Manfred K. Warmuth,et al.  Exponentiated Gradient Versus Gradient Descent for Linear Predictors , 1997, Inf. Comput..