Kernelization of Matrix Updates, When and How?

We define what it means for a learning algorithm to be kernelizable in the case when the instances are vectors, asymmetric matrices and symmetric matrices, respectively. We can characterize kernelizability in terms of an invariance of the algorithm to certain orthogonal transformations. If we assume that the algorithm's action relies on a linear prediction, then we can show that in each case the linear parameter vector must be a certain linear combination of the instances. We give a number of examples of how to apply our methods. In particular we show how to kernelize multiplicative updates for symmetric instance matrices.

[1]  Gunnar Rätsch,et al.  Prototype Classification: Insights from Machine Learning , 2009, Neural Computation.

[2]  Mark Herbster,et al.  Tracking the Best Linear Predictor , 2001, J. Mach. Learn. Res..

[3]  Manfred K. Warmuth,et al.  Online Variance Minimization , 2006, COLT.

[4]  Manfred K. Warmuth,et al.  Sample Compression, Learnability, and the Vapnik-Chervonenkis Dimension , 1995, Machine Learning.

[5]  Manfred K. Warmuth,et al.  Relative Loss Bounds for On-Line Density Estimation with the Exponential Family of Distributions , 1999, Machine Learning.

[6]  G. Wahba,et al.  Some results on Tchebycheffian spline functions , 1971 .

[7]  Manfred K. Warmuth,et al.  The Weighted Majority Algorithm , 1994, Inf. Comput..

[8]  Ambuj Tewari,et al.  On the Universality of Online Mirror Descent , 2011, NIPS.

[9]  V. Vovk Competitive On‐line Statistics , 2001 .

[10]  Manfred K. Warmuth,et al.  Additive versus exponentiated gradient updates for linear prediction , 1995, STOC '95.

[11]  Manfred K. Warmuth,et al.  Online kernel PCA with entropic matrix updates , 2007, ICML '07.

[12]  B. Scholkopf,et al.  Fisher discriminant analysis with kernels , 1999, Neural Networks for Signal Processing IX: Proceedings of the 1999 IEEE Signal Processing Society Workshop (Cat. No.98TH8468).

[13]  Gunnar Rätsch,et al.  Matrix Exponentiated Gradient Updates for On-line Learning and Bregman Projection , 2004, J. Mach. Learn. Res..

[14]  Leslie G. Valiant,et al.  A theory of the learnable , 1984, STOC '84.

[15]  S. V. N. Vishwanathan,et al.  Leaving the Span , 2005, COLT.

[16]  Bernhard Schölkopf,et al.  Nonlinear Component Analysis as a Kernel Eigenvalue Problem , 1998, Neural Computation.

[17]  Bernhard Schölkopf,et al.  A Generalized Representer Theorem , 2001, COLT/EuroCOLT.

[18]  Francis R. Bach,et al.  A New Approach to Collaborative Filtering: Operator Estimation with Spectral Regularization , 2008, J. Mach. Learn. Res..

[19]  Charles A. Micchelli,et al.  When is there a representer theorem? Vector versus matrix regularizers , 2008, J. Mach. Learn. Res..

[20]  Manfred K. Warmuth Winnowing subspaces , 2007, ICML '07.

[21]  Bernhard Schölkopf,et al.  The representer theorem for Hilbert spaces: a necessary and sufficient condition , 2012, NIPS.

[22]  Ambuj Tewari,et al.  Composite objective mirror descent , 2010, COLT 2010.

[23]  N. Littlestone Learning Quickly When Irrelevant Attributes Abound: A New Linear-Threshold Algorithm , 1987, 28th Annual Symposium on Foundations of Computer Science (sfcs 1987).

[24]  Jürgen Forster,et al.  On Relative Loss Bounds in Generalized Linear Regression , 1999, FCT.

[25]  Bernhard E. Boser,et al.  A training algorithm for optimal margin classifiers , 1992, COLT '92.

[26]  Claudio Gentile,et al.  Linear Algorithms for Online Multitask Classification , 2010, COLT.

[27]  A. Ng Feature selection, L1 vs. L2 regularization, and rotational invariance , 2004, Twenty-first international conference on Machine learning - ICML '04.

[28]  Manfred K. Warmuth,et al.  Randomized Online PCA Algorithms with Regret Bounds that are Logarithmic in the Dimension , 2008 .

[29]  A. Moore,et al.  Forecasting Web Page Views: Methods and Observations , 2008 .

[30]  Martin Alexander Youngson,et al.  Linear Functional Analysis , 2000 .