An approach for classification of highly imbalanced data using weighting and undersampling

Real-world datasets commonly have issues with data imbalance. There are several approaches such as weighting, sub-sampling, and data modeling for handling these data. Learning in the presence of data imbalances presents a great challenge to machine learning. Techniques such as support-vector machines have excellent performance for balanced data, but may fail when applied to imbalanced datasets. In this paper, we propose a new undersampling technique for selecting instances from the majority class. The performance of this approach was evaluated in the context of several real biological imbalanced data. The ratios of negative to positive samples vary from ~9:1 to ~100:1. Useful classifiers have high sensitivity and specificity. Our results demonstrate that the proposed selection technique improves the sensitivity compared to weighted support-vector machine and available results in the literature for the same datasets.

[1]  Janet M. Thornton,et al.  The Catalytic Site Atlas: a resource of catalytic sites and residues identified in enzymes using structural data , 2004, Nucleic Acids Res..

[2]  Dunja Mladenic,et al.  Feature Selection for Unbalanced Class Distribution and Naive Bayes , 1999, ICML.

[3]  Nello Cristianini,et al.  Controlling the Sensitivity of Support Vector Machines , 1999 .

[4]  X.-D. Sun,et al.  Prediction of protein structural classes using support vector machines , 2006, Amino Acids.

[5]  Vasile Palade,et al.  A New Performance Measure for Class Imbalance Learning. Application to Bioinformatics Problems , 2009, 2009 International Conference on Machine Learning and Applications.

[6]  Jacek M. Zurada,et al.  Training neural network classifiers for medical decision making: The effects of imbalanced datasets on classification performance , 2008, Neural Networks.

[7]  George Forman,et al.  An Extensive Empirical Study of Feature Selection Metrics for Text Classification , 2003, J. Mach. Learn. Res..

[8]  Jianping Zhang,et al.  Learning rules from highly unbalanced data sets , 2004, Fourth IEEE International Conference on Data Mining (ICDM'04).

[9]  Federico Girosi,et al.  Training support vector machines: an application to face detection , 1997, Proceedings of IEEE Computer Society Conference on Computer Vision and Pattern Recognition.

[10]  De-Shuang Huang,et al.  Predicting protein–protein interactions from sequence using correlation coefficient and high-quality interaction dataset , 2010, Amino Acids.

[11]  Corinna Cortes,et al.  Prediction of Generalization Ability in Learning Machines , 1994 .

[12]  Y. Wang,et al.  PRINTR: Prediction of RNA binding sites in proteins using SVM and profiles , 2008, Amino Acids.

[13]  Edward Y. Chang,et al.  Class-Boundary Alignment for Imbalanced Dataset Learning , 2003 .

[14]  K. Nishikawa,et al.  Radial locations of amino acid residues in a globular protein: correlation with the sequence. , 1986, Journal of biochemistry.

[15]  Neil Davey,et al.  Using Real-Valued Meta Classifiers to Integrate and Contextualize Binding Site Predictions , 2007, ICANNGA.

[16]  Charlotte M. Deane,et al.  JOY: protein sequence-structure representation and analysis , 1998, Bioinform..

[17]  Nitesh V. Chawla,et al.  Editorial: special issue on learning from imbalanced data sets , 2004, SKDD.

[18]  Yanqing Zhang,et al.  SVMs Modeling for Highly Imbalanced Classification , 2009, IEEE Transactions on Systems, Man, and Cybernetics, Part B (Cybernetics).

[19]  K. Chou,et al.  Prediction of linear B-cell epitopes using amino acid pair antigenicity scale , 2007, Amino Acids.

[20]  Liam J. McGuffin,et al.  The PSIPRED protein structure prediction server , 2000, Bioinform..

[21]  Vasile Palade,et al.  microPred: effective classification of pre-miRNAs for human miRNA gene prediction , 2009, Bioinform..

[22]  Minoru Kanehisa,et al.  AAindex: amino acid index database, progress report 2008 , 2007, Nucleic Acids Res..

[23]  Xue-wen Chen,et al.  Sequence-based prediction of protein interaction sites with an integrative method , 2009, Bioinform..

[24]  Ana Paula Sales,et al.  Improving peptide-MHC class I binding prediction for unbalanced datasets , 2008, BMC Bioinformatics.

[25]  Stephen Kwek,et al.  Applying Support Vector Machines to Imbalanced Datasets , 2004, ECML.

[26]  T. N. Bhat,et al.  The Protein Data Bank , 2000, Nucleic Acids Res..

[27]  Xiao Sun,et al.  Prediction of DNA-binding residues in proteins from amino acid sequences using a random forest model with a hybrid feature , 2008, Bioinform..

[28]  Zhi-Hua Zhou,et al.  Exploratory Undersampling for Class-Imbalance Learning , 2009, IEEE Transactions on Systems, Man, and Cybernetics, Part B (Cybernetics).

[29]  K.-C. Chou,et al.  Using string kernel to predict signal peptide cleavage site based on subsite coupling model , 2005, Amino Acids.

[30]  P. Suganthan,et al.  Identification of catalytic residues from protein structure using support vector machine with sequence and structural features. , 2008, Biochemical and biophysical research communications.

[31]  Adam Godzik,et al.  Cd-hit: a fast program for clustering and comparing large sets of protein or nucleotide sequences , 2006, Bioinform..

[32]  Stan Matwin,et al.  Learning When Negative Examples Abound , 1997, ECML.

[33]  Louise C. Showe,et al.  Bioinformatics Original Paper Combining Multi-species Genomic Data for Microrna Identification Using a Naı¨ve Bayes Classifier , 2022 .

[34]  Chih-Jen Lin,et al.  LIBSVM: A library for support vector machines , 2011, TIST.

[35]  Thorsten Joachims,et al.  Text Categorization with Support Vector Machines: Learning with Many Relevant Features , 1998, ECML.

[36]  Constantin F. Aliferis,et al.  A comprehensive evaluation of multicategory classification methods for microarray gene expression cancer diagnosis , 2004, Bioinform..

[37]  Vladimir Vapnik,et al.  Statistical learning theory , 1998 .

[38]  G. Raghava,et al.  Prediction of mitochondrial proteins of malaria parasite using split amino acid composition and PSSM profile , 2010, Amino Acids.

[39]  Zheng Rong Yang,et al.  Biological applications of support vector machines , 2004, Briefings Bioinform..