论文信息 - Additive versus exponentiated gradient updates for linear prediction

Additive versus exponentiated gradient updates for linear prediction

We consider two algorithms for on-line prediction based on a linear model. The algorithms are the well-known Gradient Descent (GD) algorithm and a new algorithm, which we call EG *. They both maintain a weight vector using simple updates. For the GD algorithm, the weight vector is updated by subtracting from it the gradient of the squared error made on a prediction multiplied by a parameter called the learning rate. The EG* uses the components of the gradient in the exponents of factors that are used in updating the weight vector multiplicatively. We present worst-case on-line loss bounds for EG* and compare them to previously known bounds for the GD algorithm. The bounds suggest that although the on-line losses of the algorithms are in general incomparable, EG * has a much smaller loss if only few of the input variables are relevant for the predictions. Experiments show that the worst-case upper bounds are quite tight already on simple artificial data. Our main methodological idea is using a distance function between weight vectors both in motivating the algorithms and as a potential function in an amortized analysis that leads to worst-case loss bounds. Using squared Euclidean distance leads to the GD algorithm, and using the relative entropy leads to the EG* algorithm.

Manfred K. Warmuth | Jyrki Kivinen | Jyrki Kivinen

[1] F ROSENBLATT,et al. The perceptron: a probabilistic model for information storage and organization in the brain. , 1958, Psychological review.

[2] Vladimir Vapnik,et al. Chervonenkis: On the uniform convergence of relative frequencies of events to their probabilities , 1971 .

[3] Richard O. Duda,et al. Pattern classification and scene analysis , 1974, A Wiley-Interscience publication.

[4] Bernard Widrow,et al. Adaptive Signal Processing , 1985 .

[5] David Haussler,et al. Learnability and the Vapnik-Chervonenkis dimension , 1989, JACM.

[6] Vladimir Vovk,et al. Aggregating strategies , 1990, COLT '90.

[7] N. Littlestone. Mistake bounds and logarithmic linear-threshold learning algorithms , 1990 .

[8] Bernhard E. Boser,et al. A training algorithm for optimal margin classifiers , 1992, COLT '92.

[9] David Haussler,et al. How to use expert advice , 1993, STOC.

[10] Philip M. Long,et al. WORST-CASE QUADRATIC LOSS BOUNDS FOR ON-LINE PREDICTION OF LINEAR FUNCTIONS BY GRADIENT DESCENT , 1993 .

[11] S. Hyakin,et al. Neural Networks: A Comprehensive Foundation , 1994 .

[12] Manfred K. Warmuth,et al. The Weighted Majority Algorithm , 1994, Inf. Comput..

[13] David Haussler,et al. Tight worst-case loss bounds for predicting with expert advice , 1994, EuroCOLT.

[14] Shun-ichi Amari,et al. Information geometry of the EM and em algorithms for neural networks , 1995, Neural Networks.

[15] Shun-ichi Amari,et al. The EM Algorithm and Information Geometry in Neural Network Learning , 1995, Neural Computation.

[16] Manfred K. Warmuth,et al. The perceptron algorithm vs. Winnow: linear vs. logarithmic mistake bounds when few input variables are relevant , 1995, COLT '95.

[17] Manfred K. Warmuth,et al. Exponentiated Gradient Versus Gradient Descent for Linear Predictors , 1997, Inf. Comput..