Online learning with kernels

Kernel-based algorithms such as support vector machines have achieved considerable success in various problems in batch setting, where all of the training data is available in advance. Support vector machines combine the so-called kernel trick with the large margin idea. There has been little use of these methods in an online setting suitable for real-time applications. In this paper, we consider online learning in a reproducing kernel Hilbert space. By considering classical stochastic gradient descent within a feature space and the use of some straightforward tricks, we develop simple and computationally efficient algorithms for a wide range of problems such as classification, regression, and novelty detection. In addition to allowing the exploitation of the kernel trick in an online setting, we examine the value of large margins for classification in the online setting with a drifting target. We derive worst-case loss bounds, and moreover, we show the convergence of the hypothesis to the minimizer of the regularized risk functional. We present some experimental results that support the theory as well as illustrating the power of the new algorithms for online novelty detection.

[1]  Albert B Novikoff,et al.  ON CONVERGENCE PROOFS FOR PERCEPTRONS , 1963 .

[2]  G. Wahba,et al.  Some results on Tchebycheffian spline functions , 1971 .

[3]  P. J. Huber The 1972 Wald Lecture Robust Statistics: A Review , 1972 .

[4]  S. Haykin,et al.  Adaptive Filter Theory , 1986 .

[5]  N. Littlestone Learning Quickly When Irrelevant Attributes Abound: A New Linear-Threshold Algorithm , 1987, 28th Annual Symposium on Foundations of Computer Science (sfcs 1987).

[6]  O. Mangasarian,et al.  Robust linear programming discrimination of two linearly inseparable sets , 1992 .

[7]  Philip M. Long,et al.  WORST-CASE QUADRATIC LOSS BOUNDS FOR ON-LINE PREDICTION OF LINEAR FUNCTIONS BY GRADIENT DESCENT , 1993 .

[8]  Alexander J. Smola,et al.  Support Vector Method for Function Approximation, Regression Estimation and Signal Processing , 1996, NIPS.

[9]  Philip M. Long,et al.  Worst-case quadratic loss bounds for prediction using linear functions and gradient descent , 1996, IEEE Trans. Neural Networks.

[10]  Yoav Freund,et al.  Large Margin Classification Using the Perceptron Algorithm , 1998, COLT' 98.

[11]  Alexander J. Smola,et al.  Learning with kernels , 1998 .

[12]  Claudio Gentile,et al.  The Robustness of the p-Norm Algorithms , 1999, COLT '99.

[13]  John C. Platt,et al.  Fast training of support vector machines using sequential minimal optimization, advances in kernel methods , 1999 .

[14]  Manfred Opper,et al.  Sparse Representation for Gaussian Process Models , 2000, NIPS.

[15]  Thore Graepel,et al.  From Margin to Sparsity , 2000, NIPS.

[16]  J. Friedman Special Invited Paper-Additive logistic regression: A statistical view of boosting , 2000 .

[17]  Bernhard Schölkopf,et al.  New Support Vector Algorithms , 2000, Neural Computation.

[18]  James A. Bucklew,et al.  Support vector machine techniques for nonlinear equalization , 2000, IEEE Trans. Signal Process..

[19]  Gert Cauwenberghs,et al.  Incremental and Decremental Support Vector Machine Learning , 2000, NIPS.

[20]  Peter L. Bartlett,et al.  Functional Gradient Techniques for Combining Hypotheses , 2000 .

[21]  Claudio Gentile,et al.  A New Approximate Maximal Margin Classification Algorithm , 2002, J. Mach. Learn. Res..

[22]  Bernhard Schölkopf,et al.  A Generalized Representer Theorem , 2001, COLT/EuroCOLT.

[23]  Manfred K. Warmuth,et al.  Tracking a Small Set of Experts by Mixing Past Posteriors , 2003, J. Mach. Learn. Res..

[24]  Mark Herbster,et al.  Learning Additive Models Online with Fast Evaluating Kernels , 2001, COLT/EuroCOLT.

[25]  Mark Herbster,et al.  Tracking the Best Linear Predictor , 2001, J. Mach. Learn. Res..

[26]  Bernhard Schölkopf,et al.  Estimating the Support of a High-Dimensional Distribution , 2001, Neural Computation.

[27]  Claudio Gentile,et al.  Adaptive and Self-Confident On-Line Learning Algorithms , 2000, J. Comput. Syst. Sci..

[28]  Alexander J. Smola,et al.  Large Margin Classification for Moving Targets , 2002, ALT.

[29]  Ralf Herbrich,et al.  Learning Kernel Classifiers: Theory and Algorithms , 2001 .

[30]  Alexander J. Smola,et al.  Fast Kernels for String and Tree Matching , 2002, NIPS.

[31]  Chris Mesterharm,et al.  Tracking Linear-threshold Concepts with Winnow , 2003, J. Mach. Learn. Res..

[32]  Thomas G. Dietterich,et al.  Editors. Advances in Neural Information Processing Systems , 2002 .

[33]  Michael Vogt,et al.  SMO Algorithms for Support Vector Machines without Bias Term , 2002 .

[34]  Peter Auer,et al.  Tracking the Best Disjunction , 1998, Machine Learning.

[35]  Yi Li,et al.  The Relaxed Online Maximum Margin Algorithm , 1999, Machine Learning.

[36]  Claudio Gentile,et al.  On the generalization ability of on-line learning algorithms , 2001, IEEE Transactions on Information Theory.