Path kernels and multiplicative updates

Kernels are typically applied to linear algorithms whose weight vector is a linear combination of the feature vectors of the examples. On-line versions of these algorithms are sometimes called "additive updates" because they add a multiple of the last feature vector to the current weight vector.In this paper we have found a way to use special convolution kernels to efficiently implement "multiplicative" updates. The kernels are defined by a directed graph. Each edge contributes an input. The inputs along a path form a product feature and all such products build the feature vector associated with the inputs.We also have a set of probabilities on the edges so that the outflow from each vertex is one. We then discuss multiplicative updates on these graphs where the prediction is essentially a kernel computation and the update contributes a factor to each edge. After adding the factors to the edges, the total outflow out of each vertex is not one any more. However some clever algorithms re-normalize the weights on the paths so that the total outflow out of each vertex is one again. Finally, we show that if the digraph is built from a regular expressions, then this can be used for speeding up the kernel and re-normalization computations.We reformulate a large number of multiplicative update algorithms using path kernels and characterize the applicability of our method. The examples include efficient algorithms for learning disjunctions and a recent algorithm that predicts as well as the best pruning of a series parallel digraphs.

[1]  Eugene L. Lawler,et al.  The Recognition of Series Parallel Digraphs , 1982, SIAM J. Comput..

[2]  N. Littlestone Learning Quickly When Irrelevant Attributes Abound: A New Linear-Threshold Algorithm , 1987, 28th Annual Symposium on Foundations of Computer Science (sfcs 1987).

[3]  Alfredo De Santis,et al.  Learning probabilistic prediction functions , 1988, [Proceedings 1988] 29th Annual Symposium on Foundations of Computer Science.

[4]  Vladimir Vovk,et al.  Aggregating strategies , 1990, COLT '90.

[5]  Yossi Azar,et al.  Competitive routing of virtual circuits with unknown duration , 1994, SODA '94.

[6]  Manfred K. Warmuth,et al.  The Weighted Majority Algorithm , 1994, Inf. Comput..

[7]  Robert E. Schapire,et al.  Predicting Nearly as Well as the Best Pruning of a Decision Tree , 1995, COLT.

[8]  Yoram Singer,et al.  Training Algorithms for Hidden Markov Models using Entropy Based Distance Functions , 1996, NIPS.

[9]  Manfred K. Warmuth,et al.  The Perceptron Algorithm Versus Winnow: Linear Versus Logarithmic Mistake Bounds when Few Input Variables are Relevant (Technical Note) , 1997, Artif. Intell..

[10]  Yoav Freund,et al.  A decision-theoretic generalization of on-line learning and an application to boosting , 1997, EuroCOLT.

[11]  Manfred K. Warmuth,et al.  Exponentiated Gradient Versus Gradient Descent for Linear Predictors , 1997, Inf. Comput..

[12]  Tom Bylander,et al.  The binary exponentiated gradient algorithm for learning linear functions , 1997, COLT '97.

[13]  Claudio Gentile,et al.  Linear Hinge Loss and Average Margin , 1998, NIPS.

[14]  Manfred K. Warmuth,et al.  Efficient Learning With Virtual Threshold Gates , 1995, Inf. Comput..

[15]  Claudio Gentile,et al.  The Robustness of the p-Norm Algorithms , 1999, COLT '99.

[16]  David Haussler,et al.  Convolution kernels on discrete structures , 1999 .

[17]  Manfred K. Warmuth,et al.  Averaging Expert Predictions , 1999, EuroCOLT.

[18]  Nello Cristianini,et al.  A multiplicative updating algorithm for training support vector machine , 1999, ESANN.

[19]  Nello Cristianini,et al.  An Introduction to Support Vector Machines and Other Kernel-based Learning Methods , 2000 .

[20]  Bernhard Schölkopf,et al.  Dynamic Alignment Kernels , 2000 .

[21]  Vladimir Vovk,et al.  Predicting nearly as well as the best pruning of a decision tree through dynamic programming scheme , 2001, Theor. Comput. Sci..

[22]  Rocco A. Servedio,et al.  Efficiency versus Convergence of Boolean Kernels for On-Line Learning Algorithms , 2001, NIPS.

[23]  Mehryar Mohri,et al.  Rational Kernels , 2002, NIPS.

[24]  Manfred K. Warmuth,et al.  Predicting nearly as well as the best pruning of a planar decision graph , 2002, Theor. Comput. Sci..

[25]  Manfred K. Warmuth,et al.  Direct and indirect algorithms for on-line learning of disjunctions , 2002, Theor. Comput. Sci..

[26]  Rafail Ostrovsky,et al.  Dynamic routing on networks with fixed-size buffers , 2003, SODA '03.

[27]  Daniel D. Lee,et al.  Multiplicative Updates for Large Margin Classifiers , 2003, COLT.