Step Size Adaptation in Reproducing Kernel Hilbert Space

This paper presents an online support vector machine (SVM) that uses the stochastic meta-descent (SMD) algorithm to adapt its step size automatically. We formulate the online learning problem as a stochastic gradient descent in reproducing kernel Hilbert space (RKHS) and translate SMD to the nonparametric setting, where its gradient trace parameter is no longer a coefficient vector but an element of the RKHS. We derive efficient updates that allow us to perform the step size adaptation in linear time. We apply the online SVM framework to a variety of loss functions, and in particular show how to handle structured output spaces and achieve efficient online multiclass classification. Experiments show that our algorithm outperforms more primitive methods for setting the gradient step size.

[1]  F ROSENBLATT,et al.  The perceptron: a probabilistic model for information storage and organization in the brain. , 1958, Psychological review.

[2]  H. D. Block The perceptron: a model for brain functioning. I , 1962 .

[3]  Albert B Novikoff,et al.  ON CONVERGENCE PROOFS FOR PERCEPTRONS , 1963 .

[4]  Marvin Minsky,et al.  Perceptrons: An Introduction to Computational Geometry , 1969 .

[5]  Richard S. Sutton,et al.  Goal Seeking Components for Adaptive Intelligence: An Initial Assessment. , 1981 .

[6]  Robert A. Jacobs,et al.  Increased rates of convergence through learning rate adaptation , 1987, Neural Networks.

[7]  Lawrence D. Jackel,et al.  Backpropagation Applied to Handwritten Zip Code Recognition , 1989, Neural Computation.

[8]  Luís B. Almeida,et al.  Acceleration Techniques for the Backpropagation Algorithm , 1990, EURASIP Workshop.

[9]  Tom Tollenaere,et al.  SuperSAB: Fast adaptive back propagation with good scaling properties , 1990, Neural Networks.

[10]  O. Mangasarian,et al.  Robust linear programming discrimination of two linearly inseparable sets , 1992 .

[11]  Martin A. Riedmiller,et al.  A direct adaptive method for faster backpropagation learning: the RPROP algorithm , 1993, IEEE International Conference on Neural Networks.

[12]  Barak A. Pearlmutter Fast Exact Multiplication by the Hessian , 1994, Neural Computation.

[13]  Manfred K. Warmuth,et al.  On Weak Learning , 1995, J. Comput. Syst. Sci..

[14]  Mance E. Harmon,et al.  Multi-Agent Residual Advantage Learning with General Function Approximation. , 1996 .

[15]  Mark Harmon Multi-player residual advantage learning with general function , 1996 .

[16]  Bernhard Schölkopf,et al.  Improving the accuracy and speed of support vector learning machines , 1997, NIPS 1997.

[17]  Manfred K. Warmuth,et al.  Exponentiated Gradient Versus Gradient Descent for Linear Predictors , 1997, Inf. Comput..

[18]  Yoav Freund,et al.  Large Margin Classification Using the Perceptron Algorithm , 1998, COLT' 98.

[19]  Nello Cristianini,et al.  The Kernel-Adatron Algorithm: A Fast and Simple Learning Procedure for Support Vector Machines , 1998, ICML.

[20]  Yoav Freund,et al.  Large Margin Classification Using the Perceptron Algorithm , 1998, COLT.

[21]  Alexander J. Smola,et al.  Learning with kernels , 1998 .

[22]  Thibault Langlois,et al.  Parameter adaptation in stochastic optimization , 1999 .

[23]  Claudio Gentile,et al.  The Robustness of the p-Norm Algorithms , 1999, COLT '99.

[24]  Nicol N. Schraudolph,et al.  Local Gain Adaptation in Stochastic Gradient Descent , 1999 .

[25]  Nicol N. Schraudolph,et al.  Online Independent Component Analysis with Local Learning Rate Adaptation , 1999, NIPS.

[26]  Andreas Griewank,et al.  Evaluating derivatives - principles and techniques of algorithmic differentiation, Second Edition , 2000, Frontiers in applied mathematics.

[27]  Claudio Gentile,et al.  A New Approximate Maximal Margin Classification Algorithm , 2002, J. Mach. Learn. Res..

[28]  Mark Herbster,et al.  Learning Additive Models Online with Fast Evaluating Kernels , 2001, COLT/EuroCOLT.

[29]  Bernhard Schölkopf,et al.  Estimating the Support of a High-Dimensional Distribution , 2001, Neural Computation.

[30]  Nicol N. Schraudolph,et al.  Fast Curvature Matrix-Vector Products for Second-Order Gradient Descent , 2002, Neural Computation.

[31]  Koby Crammer,et al.  Ultraconservative Online Algorithms for Multiclass Problems , 2001, J. Mach. Learn. Res..

[32]  Koby Crammer,et al.  Online Classification on a Budget , 2003, NIPS.

[33]  Alexander J. Smola,et al.  Online learning with kernels , 2001, IEEE Transactions on Signal Processing.

[34]  Yi Li,et al.  The Relaxed Online Maximum Margin Algorithm , 1999, Machine Learning.

[35]  Koby Crammer,et al.  On the Learnability and Design of Output Codes for Multiclass Problems , 2002, Machine Learning.

[36]  Corinna Cortes,et al.  Support-Vector Networks , 1995, Machine Learning.

[37]  Manfred K. Warmuth,et al.  Relative Loss Bounds for On-Line Density Estimation with the Exponential Family of Distributions , 1999, Machine Learning.

[38]  B. Yegnanarayana,et al.  Artificial Neural Networks , 2004 .

[39]  Thomas Hofmann,et al.  Support vector machine learning for interdependent and structured output spaces , 2004, ICML.

[40]  Thomas Hofmann,et al.  Hierarchical document categorization with support vector machines , 2004, CIKM '04.

[41]  Thomas Hofmann,et al.  Exponential Families for Conditional Random Fields , 2004, UAI.

[42]  Jason Weston,et al.  Fast Kernel Classifiers with Online and Active Learning , 2005, J. Mach. Learn. Res..

[43]  Nicol N. Schraudolph,et al.  Fast Online Policy Gradient Learning with SMD Gain Vector Adaptation , 2005, NIPS.

[44]  Bernhard Schölkopf,et al.  Iterative kernel principal component analysis for image modeling , 2005, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[45]  Nicol N. Schraudolph,et al.  Stochastic optimisation for high-dimensional tracking in dense range maps , 2005 .

[46]  Jason Weston,et al.  Online (and Offline) on an Even Tighter Budget , 2005, AISTATS.

[47]  Yoram Singer,et al.  A New Perspective on an Old Perceptron Algorithm , 2005, COLT.

[48]  Yoram Singer,et al.  The Forgetron: A Kernel-Based Perceptron on a Fixed Budget , 2005, NIPS.

[49]  Mark W. Schmidt,et al.  Accelerated training of conditional random fields with stochastic gradient methods , 2006, ICML.

[50]  R. Sutton Gain Adaptation Beats Least Squares , 2006 .

[51]  Luc Van Gool,et al.  Fast stochastic optimization for articulated structure tracking , 2007, Image Vis. Comput..

[52]  Iain Murray,et al.  Introduction To Gaussian Processes , 2008 .