Near-Bayesian Support Vector Machines for imbalanced data classification with equal or unequal misclassification costs

Support Vector Machines (SVMs) form a family of popular classifier algorithms originally developed to solve two-class classification problems. However, SVMs are likely to perform poorly in situations with data imbalance between the classes, particularly when the target class is under-represented. This paper proposes a Near-Bayesian Support Vector Machine (NBSVM) for such imbalanced classification problems, by combining the philosophies of decision boundary shift and unequal regularization costs. Based on certain assumptions which hold true for most real-world datasets, we use the fractions of representation from each of the classes, to achieve the boundary shift as well as the asymmetric regularization costs. The proposed approach is extended to the multi-class scenario and also adapted for cases with unequal misclassification costs for the different classes. Extensive comparison with standard SVM and some state-of-the-art methods is furnished as a proof of the ability of the proposed approach to perform competitively on imbalanced datasets. A modified Sequential Minimal Optimization (SMO) algorithm is also presented to solve the NBSVM optimization problem in a computationally efficient manner.

[1]  Adam Kowalczyk,et al.  Extreme re-balancing for SVMs: a case study , 2004, SKDD.

[2]  John C. Platt,et al.  Fast training of support vector machines using sequential minimal optimization, advances in kernel methods , 1999 .

[3]  Francisco Herrera,et al.  A Review on Ensembles for the Class Imbalance Problem: Bagging-, Boosting-, and Hybrid-Based Approaches , 2012, IEEE Transactions on Systems, Man, and Cybernetics, Part C (Applications and Reviews).

[4]  S. Sathiya Keerthi,et al.  Improvements to Platt's SMO Algorithm for SVM Classifier Design , 2001, Neural Computation.

[5]  Joarder Kamruzzaman,et al.  z-SVM: An SVM for Improved Classification of Imbalanced Data , 2006, Australian Conference on Artificial Intelligence.

[6]  Q. Wang A Hybrid Sampling SVM Approach to Imbalanced Data Classification , 2014 .

[7]  Xiaoou Li,et al.  Imbalanced data classification via support vector machines and genetic algorithms , 2014, Connect. Sci..

[8]  Nello Cristianini,et al.  Controlling the Sensitivity of Support Vector Machines , 1999 .

[9]  Mahdi Mahfouf,et al.  Support Vector Machines for Class Imbalance Rail Data Classification with Bootstrapping-based Over-Sampling and Under-Sampling , 2014 .

[10]  Ying Nian Wu,et al.  Unbalanced data classification using support vector machines with active learning on scleroderma lung disease patterns , 2015 .

[11]  Alfredo Petrosino,et al.  Adjusted F-measure and kernel scaling for imbalanced data learning , 2014, Inf. Sci..

[12]  Francisco Herrera,et al.  Addressing the Classification with Imbalanced Data: Open Problems and New Challenges on Class Distribution , 2011, HAIS.

[13]  Longin Jan Latecki,et al.  Improving SVM Classification on Imbalanced Data Sets in Distance Spaces , 2009, 2009 Ninth IEEE International Conference on Data Mining.

[14]  Peng Li,et al.  SVM Classification for High-dimensional Imbalanced Data based on SNR and Under-sampling , 2015, MUE 2015.

[15]  Edward Y. Chang,et al.  Adaptive Feature-Space Conformal Transformation for Imbalanced-Data Learning , 2003, ICML.

[16]  Christopher J. C. Burges,et al.  A Tutorial on Support Vector Machines for Pattern Recognition , 1998, Data Mining and Knowledge Discovery.

[17]  Marti A. Hearst Trends & Controversies: Support Vector Machines , 1998, IEEE Intell. Syst..

[18]  Wei Duan,et al.  Imbalanced Data Classification Using Cost-Sensitive Support Vector Machine Based on Information Entropy , 2014, CIT 2014.

[19]  Nathalie Japkowicz,et al.  Boosting support vector machines for imbalanced data sets , 2008, Knowledge and Information Systems.

[20]  F. Wilcoxon Individual Comparisons by Ranking Methods , 1945 .

[21]  Nitesh V. Chawla,et al.  SMOTE: Synthetic Minority Over-sampling Technique , 2002, J. Artif. Intell. Res..

[22]  Chih-Jen Lin,et al.  LIBSVM: A library for support vector machines , 2011, TIST.

[23]  Shigeo Abe,et al.  Analysis of support vector machines , 2002, Proceedings of the 12th IEEE Workshop on Neural Networks for Signal Processing.

[24]  Fei-Yue Wang,et al.  Posterior probability support vector Machines for unbalanced data , 2005, IEEE Transactions on Neural Networks.

[25]  Yi Lin,et al.  Support Vector Machines for Classification in Nonstandard Situations , 2002, Machine Learning.

[26]  Haibo He,et al.  Learning from Imbalanced Data , 2009, IEEE Transactions on Knowledge and Data Engineering.

[27]  Ming-Syan Chen,et al.  On Generalizable Low False-Positive Learning Using Asymmetric Support Vector Machines , 2013, IEEE Transactions on Knowledge and Data Engineering.

[28]  Francisco Herrera,et al.  Advanced nonparametric tests for multiple comparisons in the design of experiments in computational intelligence and data mining: Experimental analysis of power , 2010, Inf. Sci..

[29]  Vasile Palade,et al.  Efficient resampling methods for training support vector machines with imbalanced datasets , 2010, The 2010 International Joint Conference on Neural Networks (IJCNN).

[30]  Ting-ting Bi,et al.  Imbalanced Data SVM Classification Method Based on Cluster Boundary Sampling and DT-KNN Pruning , 2014 .

[31]  Fei Cheng,et al.  A Bayesian approach to support vector machines for the binary classification , 2008, Neurocomputing.

[32]  Edward Y. Chang,et al.  KBA: kernel boundary alignment considering imbalanced data distribution , 2005, IEEE Transactions on Knowledge and Data Engineering.

[33]  Zhoujun Li,et al.  Applying adaptive over-sampling technique based on data density and cost-sensitive SVM to imbalanced learning , 2012, The 2012 International Joint Conference on Neural Networks (IJCNN).

[34]  Zhao Hai,et al.  Learning from imbalanced data sets with a Min-Max modular support vector machine , 2011 .

[35]  Stephen Kwek,et al.  Applying Support Vector Machines to Imbalanced Datasets , 2004, ECML.

[36]  Corinna Cortes,et al.  Support-Vector Networks , 1995, Machine Learning.

[37]  Nuno Vasconcelos,et al.  Risk minimization, probability elicitation, and cost-sensitive SVMs , 2010, ICML.

[38]  Yong Zhang,et al.  Imbalanced data classification based on scaling kernel-based support vector machine , 2014, Neural Computing and Applications.

[39]  Dazhe Zhao,et al.  An Optimized Cost-Sensitive SVM for Imbalanced Data Learning , 2013, PAKDD.

[40]  Jianjun Wang,et al.  Margin calibration in SVM class-imbalanced learning , 2009, Neurocomputing.

[41]  Piyaphol Phoungphol,et al.  A Classification Framework for Imbalanced Data , 2013 .

[42]  Ethem Alpaydin,et al.  Multiclass Posterior Probability Support Vector Machines , 2008, IEEE Transactions on Neural Networks.

[43]  Ralf Stecking,et al.  Classification of Large Imbalanced Credit Client Data with Cluster Based SVM , 2010, GfKl.

[44]  Seyda Ertekin LEARNING IN EXTREME CONDITIONS: ONLINE AND ACTIVE LEARNING WITH MASSIVE, IMBALANCED AND NOISY DATA , 2009 .

[45]  Ping Zhong,et al.  Learning SVM with weighted maximum margin criterion for classification of imbalanced data , 2011, Math. Comput. Model..

[46]  Vasile Palade,et al.  Class Imbalance Learning Methods for Support Vector Machines , 2013 .

[47]  John K. Jackman,et al.  A selective sampling method for imbalanced data learning on support vector machines , 2010 .

[48]  Si Wu,et al.  Improving support vector machine classifiers by modifying kernel functions , 1999, Neural Networks.

[49]  Yanqing Zhang,et al.  SVMs Modeling for Highly Imbalanced Classification , 2009, IEEE Transactions on Systems, Man, and Cybernetics, Part B (Cybernetics).

[50]  Stan Matwin,et al.  Addressing the Curse of Imbalanced Training Sets: One-Sided Selection , 1997, ICML.