Diversity analysis on imbalanced data sets by using ensemble models

Many real-world applications have problems when learning from imbalanced data sets, such as medical diagnosis, fraud detection, and text classification. Very few minority class instances cannot provide sufficient information and result in performance degrading greatly. As a good way to improve the classification performance of weak learner, some ensemble-based algorithms have been proposed to solve class imbalance problem. However, it is still not clear that how diversity affects classification performance especially on minority classes, since diversity is one influential factor of ensemble. This paper explores the impact of diversity on each class and overall performance. As the other influential factor, accuracy is also discussed because of the trade-off between diversity and accuracy. Firstly, three popular re-sampling methods are combined into our ensemble model and evaluated for diversity analysis, which includes under-sampling, over-sampling, and SMOTE [1] - a data generation algorithm. Secondly, we experiment not only on two-class tasks, but also those with multiple classes. Thirdly, we improve SMOTE in a novel way for solving multi-class data sets in ensemble model - SMOTEBagging.

[1]  Nitesh V. Chawla,et al.  Exploiting Diversity in Ensembles: Improving the Performance on Unbalanced Datasets , 2007, MCS.

[2]  G. Yule,et al.  On the association of attributes in statistics, with examples from the material of the childhood society, &c , 1900, Proceedings of the Royal Society of London.

[3]  Stan Matwin,et al.  Addressing the Curse of Imbalanced Training Sets: One-Sided Selection , 1997, ICML.

[4]  Nitesh V. Chawla,et al.  Editorial: special issue on learning from imbalanced data sets , 2004, SKDD.

[5]  Ludmila I. Kuncheva,et al.  Measures of Diversity in Classifier Ensembles and Their Relationship with the Ensemble Accuracy , 2003, Machine Learning.

[6]  Nathalie Japkowicz,et al.  The class imbalance problem: A systematic study , 2002, Intell. Data Anal..

[7]  JapkowiczNathalie,et al.  The class imbalance problem: A systematic study , 2002 .

[8]  Cen Li,et al.  Classifying imbalanced data using a bagging ensemble variation (BEV) , 2007, ACM-SE 45.

[9]  Ralescu Anca,et al.  ISSUES IN MINING IMBALANCED DATA SETS - A REVIEW PAPER , 2005 .

[10]  Herna L. Viktor,et al.  Learning from imbalanced data sets with boosting and data generation: the DataBoost-IM approach , 2004, SKDD.

[11]  Gustavo E. A. P. A. Batista,et al.  A study of the behavior of several methods for balancing machine learning training data , 2004, SKDD.

[12]  José Martínez Sotoca,et al.  Combined Effects of Class Imbalance and Class Overlap on Instance-Based Classification , 2006, IDEAL.

[13]  Nitesh V. Chawla,et al.  SMOTE: Synthetic Minority Over-sampling Technique , 2002, J. Artif. Intell. Res..

[14]  R. A. Mollineda,et al.  The class imbalance problem in pattern classification and learning , 2009 .

[15]  Yoav Freund,et al.  Experiments with a New Boosting Algorithm , 1996, ICML.

[16]  Rosa Maria Valdovinos,et al.  Class-dependant resampling for medical applications , 2005, Fourth International Conference on Machine Learning and Applications (ICMLA'05).

[17]  Hui Han,et al.  Borderline-SMOTE: A New Over-Sampling Method in Imbalanced Data Sets Learning , 2005, ICIC.

[18]  I. Tomek,et al.  Two Modifications of CNN , 1976 .

[19]  Leo Breiman,et al.  Bagging Predictors , 1996, Machine Learning.

[20]  MonardMaria Carolina,et al.  A study of the behavior of several methods for balancing machine learning training data , 2004 .

[21]  Peter Tiño,et al.  Managing Diversity in Regression Ensembles , 2005, J. Mach. Learn. Res..

[22]  Yong Zhao,et al.  All Zero Block Detection Based on Statistics for AVS-M Intra Frame Prediction , 2008, 2008 International Symposium on Intelligent Information Technology Application Workshops.