An Empirical Comparison of Support Vector Machines Versus Nearest Neighbour Methods for Machine Learning Applications

Support vector machines (SVMs) are traditionally considered to be the best classifiers in terms of minimizing the empirical probability of misclassification, although they can be slow when the training datasets are large. Here SVMs are compared to the classic k-Nearest Neighbour (k-NN) decision rule using seven large real-world datasets obtained from the University of California at Irvine (UCI) Machine Learning Repository. To counterbalance the slowness of SVMs on large datasets, three simple and fast methods for reducing the size of the training data, and thus speeding up the SVMs are incorporated. One is blind random sampling. The other two are new linear-time methods for guided random sampling which we call Gaussian Condensing and Gaussian Smoothing. In spite of the speedups of SVMs obtained by incorporating Gaussian Smoothing and Condensing, the results obtained show that k-NN methods are superior to SVMs on most of the seven data sets used, and cast doubt on the general superiority of SVMs. Furthermore, random sampling works surprisingly well and is robust, suggesting that it is a worthwhile preprocessing step to either SVMs or k-NN.

[1]  Jason Weston,et al.  Fast Kernel Classifiers with Online and Active Learning , 2005, J. Mach. Learn. Res..

[2]  Tim Oates,et al.  Efficient progressive sampling , 1999, KDD '99.

[3]  Jiun-Hung Chen,et al.  Speeding up SVM decision based on mirror points , 2002, Object recognition supported by user interaction for service robots.

[4]  Luc Devroye,et al.  On the Inequality of Cover and Hart in Nearest Neighbor Discrimination , 1981, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[5]  Antônio de Pádua Braga,et al.  SVM-KM: speeding SVMs learning with a priori cluster selection and k-means , 2000, Proceedings. Vol.1. Sixth Brazilian Symposium on Neural Networks.

[6]  Jakub Nalepa,et al.  Support Vector Machines Training Data Selection Using a Genetic Algorithm , 2012, SSPR/SPR.

[7]  Chunguang Zhou,et al.  A BOUNDARY METHOD TO SPEED UP TRAINING SUPPORT VECTOR MACHINES , 2006 .

[8]  Dennis L. Wilson,et al.  Asymptotic Properties of Nearest Neighbor Rules Using Edited Data , 1972, IEEE Trans. Syst. Man Cybern..

[9]  Xiaoou Li,et al.  Fast classification for large data sets via random selection clustering and Support Vector Machines , 2012, Intell. Data Anal..

[10]  Godfried T. Toussaint,et al.  Geometric proximity graphs for improving nearest neighbor methods in instance-based learning and data mining , 2005, Int. J. Comput. Geom. Appl..

[11]  Cheng-Lin Liu,et al.  Fast multi-class sample reduction for speeding up support vector machines , 2011, 2011 IEEE International Workshop on Machine Learning for Signal Processing.

[12]  John C. Platt,et al.  Fast training of support vector machines using sequential minimal optimization, advances in kernel methods , 1999 .

[13]  Manoranjan Dash,et al.  An Evaluation of Progressive Sampling for Imbalanced Data Sets , 2006, Sixth IEEE International Conference on Data Mining - Workshops (ICDMW'06).

[14]  Caiming Zhang,et al.  Fast instance selection for speeding up support vector machines , 2013, Knowl. Based Syst..

[15]  Gert R. G. Lanckriet,et al.  Nearest Neighbor Prototyping for Sparse and Scalable Support Vector Machines , 2007 .

[16]  Xiaohua Liu,et al.  On Speeding Up Support Vector Machines: Proximity Graphs Versus Random Sampling for Pre-Selection Condensation , 2013 .

[17]  Yuh-Jye Lee,et al.  RSVM: Reduced Support Vector Machines , 2001, SDM.

[18]  Godfried T. Toussaint,et al.  Bibliography on estimation of misclassification , 1974, IEEE Trans. Inf. Theory.

[19]  Peter E. Hart,et al.  The condensed nearest neighbor rule (Corresp.) , 1968, IEEE Trans. Inf. Theory.

[20]  Godfried T. Toussaint,et al.  Proximity-Graph Instance-Based Learning, Support Vector Machines, and High Dimensionality: An Empirical Comparison , 2012, MLDM.

[21]  S. Sathiya Keerthi,et al.  Improvements to Platt's SMO Algorithm for SVM Classifier Design , 2001, Neural Computation.

[22]  Edward Y. Chang,et al.  Concept boundary detection for speeding up SVMs , 2006, ICML '06.

[23]  Vladimir N. Vapnik,et al.  The Nature of Statistical Learning Theory , 2000, Statistics for Engineering and Information Science.