Speeding up Support Vector Machines - Probabilistic versus Nearest Neighbour Methods for Condensing Training Data

Several methods for reducing the running time of support vector machines (SVMs) are compared in terms of speed-up factor and classification accuracy using seven large real world datasets obtained from the UCI Machine Learning Repository. All the methods tested are based on reducing the size of the training data that is then fed to the SVM. Two probabilistic methods are investigated that run in linear time with respect to the size of the training data: blind random sampling and a new method for guided random sampling (Gaussian Condensing). These methods are compared with k-Nearest Neighbour methods for reducing the size of the training set and for smoothing the decision boundary. For all the datasets tested blind random sampling gave the best results for speeding up SVMs without significantly sacrificing classification accuracy.

[1]  Xiaohua Liu,et al.  On Speeding Up Support Vector Machines: Proximity Graphs Versus Random Sampling for Pre-Selection Condensation , 2013 .

[2]  John Platt,et al.  Fast training of svms using sequential minimal optimization , 1998 .

[3]  Yuh-Jye Lee,et al.  RSVM: Reduced Support Vector Machines , 2001, SDM.

[4]  Godfried T. Toussaint,et al.  Bibliography on estimation of misclassification , 1974, IEEE Trans. Inf. Theory.

[5]  Peter E. Hart,et al.  The condensed nearest neighbor rule (Corresp.) , 1968, IEEE Trans. Inf. Theory.

[6]  Chunguang Zhou,et al.  A BOUNDARY METHOD TO SPEED UP TRAINING SUPPORT VECTOR MACHINES , 2006 .

[7]  Godfried T. Toussaint,et al.  Geometric proximity graphs for improving nearest neighbor methods in instance-based learning and data mining , 2005, Int. J. Comput. Geom. Appl..

[8]  Cheng-Lin Liu,et al.  Fast multi-class sample reduction for speeding up support vector machines , 2011, 2011 IEEE International Workshop on Machine Learning for Signal Processing.

[9]  Jakub Nalepa,et al.  Support Vector Machines Training Data Selection Using a Genetic Algorithm , 2012, SSPR/SPR.

[10]  Antônio de Pádua Braga,et al.  SVM-KM: speeding SVMs learning with a priori cluster selection and k-means , 2000, Proceedings. Vol.1. Sixth Brazilian Symposium on Neural Networks.

[11]  Vladimir Vapnik,et al.  The Nature of Statistical Learning , 1995 .

[12]  S. Sathiya Keerthi,et al.  Improvements to Platt's SMO Algorithm for SVM Classifier Design , 2001, Neural Computation.

[13]  Jason Weston,et al.  Breaking SVM Complexity with Cross-Training , 2004, NIPS.

[14]  Edward Y. Chang,et al.  Concept boundary detection for speeding up SVMs , 2006, ICML '06.

[15]  Caiming Zhang,et al.  Fast instance selection for speeding up support vector machines , 2013, Knowl. Based Syst..

[16]  Feng Gao,et al.  Reduction of Large Training Set by Guided Progressive Sampling: Application to Neonatal Intensive Care Data , 2007 .

[17]  Gert R. G. Lanckriet,et al.  Nearest Neighbor Prototyping for Sparse and Scalable Support Vector Machines , 2007 .

[18]  Jason Weston,et al.  Fast Kernel Classifiers with Online and Active Learning , 2005, J. Mach. Learn. Res..

[19]  Ian H. Witten,et al.  Data mining: practical machine learning tools and techniques with Java implementations , 2002, SGMD.

[20]  Ian H. Witten,et al.  Weka machine learning algorithms in java , 2000 .

[21]  Tim Oates,et al.  Efficient progressive sampling , 1999, KDD '99.

[22]  Jiun-Hung Chen,et al.  Speeding up SVM decision based on mirror points , 2002, Object recognition supported by user interaction for service robots.

[23]  Godfried T. Toussaint,et al.  Proximity-Graph Instance-Based Learning, Support Vector Machines, and High Dimensionality: An Empirical Comparison , 2012, MLDM.

[24]  Dennis L. Wilson,et al.  Asymptotic Properties of Nearest Neighbor Rules Using Edited Data , 1972, IEEE Trans. Syst. Man Cybern..

[25]  Luc Devroye,et al.  On the Inequality of Cover and Hart in Nearest Neighbor Discrimination , 1981, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[26]  Xiaoou Li,et al.  Fast classification for large data sets via random selection clustering and Support Vector Machines , 2012, Intell. Data Anal..

[27]  Manoranjan Dash,et al.  An Evaluation of Progressive Sampling for Imbalanced Data Sets , 2006, Sixth IEEE International Conference on Data Mining - Workshops (ICDMW'06).