Sparse Approximation Through Boosting for Learning Large Scale Kernel Machines

Recently, sparse approximation has become a preferred method for learning large scale kernel machines. This technique attempts to represent the solution with only a subset of original data points also known as basis vectors, which are usually chosen one by one with a forward selection procedure based on some selection criteria. The computational complexity of several resultant algorithms scales as O(NM2) in time and O(NM) in memory, where N is the number of training points and M is the number of basis vectors as well as the steps of forward selection. For some large scale data sets, to obtain a better solution, we are sometimes required to include more basis vectors, which means that M is not trivial in this situation. However, the limited computational resource (e.g., memory) prevents us from including too many vectors. To handle this dilemma, we propose to add an ensemble of basis vectors instead of only one at each forward step. The proposed method, closely related to gradient boosting, could decrease the required number M of forward steps significantly and thus a large fraction of computational cost is saved. Numerical experiments on three large scale regression tasks and a classification problem demonstrate the effectiveness of the proposed approach.

[1]  Xin Yao,et al.  Greedy forward selection algorithms to Sparse Gaussian Process Regression , 2006, The 2006 IEEE International Joint Conference on Neural Network Proceedings.

[2]  L. Breiman Arcing classifier (with discussion and a rejoinder by the author) , 1998 .

[3]  Glenn Fung,et al.  Proximal support vector machine classifiers , 2001, KDD '01.

[4]  Michael I. Jordan,et al.  Predictive low-rank decomposition for kernel methods , 2005, ICML.

[5]  John C. Platt,et al.  Fast training of support vector machines using sequential minimal optimization, advances in kernel methods , 1999 .

[6]  Andreas Wendemuth,et al.  Kernel Least-Squares Models Using Updates of the Pseudoinverse , 2006, Neural Computation.

[7]  Andy J. Keane,et al.  Some Greedy Learning Algorithms for Sparse Regression and Classification with Mercer Kernels , 2003, J. Mach. Learn. Res..

[8]  Bernhard Schölkopf,et al.  A Direct Method for Building Sparse Kernel Learning Algorithms , 2006, J. Mach. Learn. Res..

[9]  Gunnar Rätsch,et al.  An introduction to kernel-based learning algorithms , 2001, IEEE Trans. Neural Networks.

[10]  S. Sathiya Keerthi,et al.  A Fast Dual Algorithm for Kernel Logistic Regression , 2002, 2007 International Joint Conference on Neural Networks.

[11]  Sheng Chen,et al.  An orthogonal forward regression technique for sparse kernel density estimation , 2008, Neurocomputing.

[12]  L. Rebollo-Neira,et al.  Optimized orthogonal matching pursuit approach , 2002, IEEE Signal Processing Letters.

[13]  B. Scholkopf,et al.  Fisher discriminant analysis with kernels , 1999, Neural Networks for Signal Processing IX: Proceedings of the 1999 IEEE Signal Processing Society Workshop (Cat. No.98TH8468).

[14]  Alexander Gammerman,et al.  Ridge Regression Learning Algorithm in Dual Variables , 1998, ICML.

[15]  Kristin P. Bennett,et al.  Response to Mease and Wyner, Evidence Contrary to the Statistical View of Boosting, JMLR 9:131-156, 2008 , 2008 .

[16]  S. Sathiya Keerthi,et al.  Building Support Vector Machines with Reduced Classifier Complexity , 2006, J. Mach. Learn. Res..

[17]  S. Keerthi,et al.  SMO Algorithm for Least-Squares SVM Formulations , 2003, Neural Computation.

[18]  Zoubin Ghahramani,et al.  Sparse Gaussian Processes using Pseudo-inputs , 2005, NIPS.

[19]  Tomaso A. Poggio,et al.  Regularization Networks and Support Vector Machines , 2000, Adv. Comput. Math..

[20]  Ryan M. Rifkin,et al.  In Defense of One-Vs-All Classification , 2004, J. Mach. Learn. Res..

[21]  Lehel Csató,et al.  Sparse On-Line Gaussian Processes , 2002, Neural Computation.

[22]  Larry S. Davis,et al.  Efficient Kernel Machines Using the Improved Fast Gauss Transform , 2004, NIPS.

[23]  J. Friedman Greedy function approximation: A gradient boosting machine. , 2001 .

[24]  Katya Scheinberg,et al.  Efficient SVM Training Using Low-Rank Kernel Representations , 2002, J. Mach. Learn. Res..

[25]  John C. Platt A Resource-Allocating Network for Function Interpolation , 1991, Neural Computation.

[26]  Stéphane Canu,et al.  Comments on the "Core Vector Machines: Fast SVM Training on Very Large Data Sets" , 2007, J. Mach. Learn. Res..

[27]  Anthony Widjaja,et al.  Learning with Kernels: Support Vector Machines, Regularization, Optimization, and Beyond , 2003, IEEE Transactions on Neural Networks.

[29]  J. Andrew Bagnell,et al.  Kernel Conjugate Gradient for Fast Kernel Machines , 2007, IJCAI.

[30]  Xin Yao,et al.  Boosting Kernel Models for Regression , 2006, Sixth International Conference on Data Mining (ICDM'06).

[31]  Olivier Chapelle,et al.  Training a Support Vector Machine in the Primal , 2007, Neural Computation.

[32]  Andrew Y. Ng,et al.  Fast Gaussian Process Regression using KD-Trees , 2005, NIPS.

[33]  Peter L. Bartlett,et al.  Boosting Algorithms as Gradient Descent , 1999, NIPS.

[34]  Stéphane Mallat,et al.  Matching pursuits with time-frequency dictionaries , 1993, IEEE Trans. Signal Process..

[35]  Weifeng Liu,et al.  Kernel Affine Projection Algorithms , 2008, EURASIP J. Adv. Signal Process..

[36]  M. Gibbs,et al.  Efficient implementation of gaussian processes , 1997 .

[37]  Alexander J. Smola,et al.  Sparse Greedy Gaussian Process Regression , 2000, NIPS.

[38]  Pascal Vincent,et al.  Kernel Matching Pursuit , 2002, Machine Learning.

[39]  J. Friedman Special Invited Paper-Additive logistic regression: A statistical view of boosting , 2000 .

[40]  Shie Mannor,et al.  The kernel recursive least-squares algorithm , 2004, IEEE Transactions on Signal Processing.

[41]  Shang-Liang Chen,et al.  Orthogonal least squares learning algorithm for radial basis function networks , 1991, IEEE Trans. Neural Networks.

[42]  Johan A. K. Suykens,et al.  Least Squares Support Vector Machine Classifiers , 1999, Neural Processing Letters.

[43]  Licheng Jiao,et al.  Fast Sparse Approximation for Least Squares Support Vector Machine , 2007, IEEE Transactions on Neural Networks.

[44]  Balas K. Natarajan,et al.  Sparse Approximate Solutions to Linear Systems , 1995, SIAM J. Comput..

[45]  P. S. Shcherbakov,et al.  Learning in neural networks and stochastic approximation methods with averaging , 1994 .

[46]  Wei Chu,et al.  A matching pursuit approach to sparse Gaussian process regression , 2005, NIPS.

[47]  Kwong-Sak Leung,et al.  Large-scale RLSC learning without agony , 2007, ICML '07.

[48]  Gavin C. Cawley,et al.  Reduced Rank Kernel Ridge Regression , 2002, Neural Processing Letters.

[49]  Deniz Erdogmus,et al.  Continuously Differentiable Sample-Spacing Entropy Estimation , 2008, IEEE Transactions on Neural Networks.

[50]  Xin Yao,et al.  A Gradient-Based Forward Greedy Algorithm for Space Gaussian Process Regression , 2007, Trends in Neural Computation.

[51]  Gunnar Rätsch,et al.  Constructing Descriptive and Discriminative Nonlinear Features: Rayleigh Coefficients in Kernel Feature Spaces , 2003, IEEE Trans. Pattern Anal. Mach. Intell..

[52]  Bernhard Schölkopf,et al.  Sparse Greedy Matrix Approximation for Machine Learning , 2000, International Conference on Machine Learning.

[53]  David Mease,et al.  Evidence Contrary to the Statistical View of Boosting , 2008, J. Mach. Learn. Res..

[54]  Thomas G. Dietterich Multiple Classifier Systems , 2000, Lecture Notes in Computer Science.

[55]  G. Baudat,et al.  Kernel-based methods and function approximation , 2001, IJCNN'01. International Joint Conference on Neural Networks. Proceedings (Cat. No.01CH37222).

[56]  Yoav Freund,et al.  Experiments with a New Boosting Algorithm , 1996, ICML.

[57]  Bernhard Schölkopf,et al.  Nonlinear Component Analysis as a Kernel Eigenvalue Problem , 1998, Neural Computation.

[58]  Bernhard Schölkopf,et al.  A Generalized Representer Theorem , 2001, COLT/EuroCOLT.

[59]  Nello Cristianini,et al.  Kernel Methods for Pattern Analysis , 2003, ICTAI.

[60]  Samy Bengio,et al.  A Parallel Mixture of SVMs for Very Large Scale Problems , 2001, Neural Computation.

[61]  Mikhail Belkin,et al.  Manifold Regularization: A Geometric Framework for Learning from Labeled and Unlabeled Examples , 2006, J. Mach. Learn. Res..

[62]  Ivor W. Tsang,et al.  Core Vector Machines: Fast SVM Training on Very Large Data Sets , 2005, J. Mach. Learn. Res..

[63]  Ji Zhu,et al.  Kernel Logistic Regression and the Import Vector Machine , 2001, NIPS.

[64]  Jason Weston,et al.  A Distributed Sequential Solver for Large-Scale SVMs , 2007 .

[65]  Neil D. Lawrence,et al.  Fast Forward Selection to Speed Up Sparse Gaussian Process Regression , 2003, AISTATS.

[66]  Daniel Polani,et al.  Sequential Learning with LS-SVM for Large-Scale Data Sets , 2006, ICANN.

[67]  Jian-xiong Dong,et al.  Fast SVM training algorithm with decomposition on very large data sets , 2005, IEEE Transactions on Pattern Analysis and Machine Intelligence.