MWMOTE--Majority Weighted Minority Oversampling Technique for Imbalanced Data Set Learning

Imbalanced learning problems contain an unequal distribution of data samples among different classes and pose a challenge to any classifier as it becomes hard to learn the minority class samples. Synthetic oversampling methods address this problem by generating the synthetic minority class samples to balance the distribution between the samples of the majority and minority classes. This paper identifies that most of the existing oversampling methods may generate the wrong synthetic minority samples in some scenarios and make learning tasks harder. To this end, a new method, called Majority Weighted Minority Oversampling TEchnique (MWMOTE), is presented for efficiently handling imbalanced learning problems. MWMOTE first identifies the hard-to-learn informative minority class samples and assigns them weights according to their euclidean distance from the nearest majority class samples. It then generates the synthetic samples from the weighted informative minority class samples using a clustering approach. This is done in such a way that all the generated samples lie inside some minority class cluster. MWMOTE has been evaluated extensively on four artificial and 20 real-world data sets. The simulation results show that our method is better than or comparable with some other existing methods in terms of various assessment metrics, such as geometric mean (G-mean) and area under the receiver operating curve (ROC), usually known as area under curve (AUC).

[1]  Nitesh V. Chawla,et al.  SMOTE: Synthetic Minority Over-sampling Technique , 2002, J. Artif. Intell. Res..

[2]  N. Japkowicz Learning from Imbalanced Data Sets: A Comparison of Various Strategies * , 2000 .

[3]  Stan Matwin,et al.  Addressing the Curse of Imbalanced Training Sets: One-Sided Selection , 1997, ICML.

[4]  Gary M. Weiss Mining with rarity: a unifying framework , 2004, SKDD.

[5]  Robert C. Holte,et al.  Concept Learning and the Problem of Small Disjuncts , 1989, IJCAI.

[6]  Taeho Jo,et al.  Class imbalances versus small disjuncts , 2004, SKDD.

[7]  Stan Matwin,et al.  Machine Learning for the Detection of Oil Spills in Satellite Radar Images , 1998, Machine Learning.

[8]  Nitesh V. Chawla,et al.  SMOTEBoost: Improving Prediction of the Minority Class in Boosting , 2003, PKDD.

[9]  Nathalie Japkowicz,et al.  A Novelty Detection Approach to Classification , 1995, IJCAI.

[10]  Hinrich Schütze,et al.  Projections for efficient document clustering , 1997, SIGIR '97.

[11]  Foster Provost,et al.  The effect of class distribution on classifier learning: an empirical study , 2001 .

[12]  Taeho Jo,et al.  A Multiple Resampling Method for Learning from Imbalanced Data Sets , 2004, Comput. Intell..

[13]  Zhi-Hua Zhou,et al.  Exploratory Under-Sampling for Class-Imbalance Learning , 2006, ICDM.

[14]  Gregory W. Corder,et al.  Nonparametric Statistics for Non-Statisticians: A Step-by-Step Approach , 2009 .

[15]  J. Ross Quinlan,et al.  Induction of Decision Trees , 1986, Machine Learning.

[16]  Robert C. Holte,et al.  Severe Class Imbalance: Why Better Algorithms Aren't the Answer , 2005, ECML.

[17]  Alberto Maria Segre,et al.  Programs for Machine Learning , 1994 .

[18]  Charles X. Ling,et al.  Data Mining for Direct Marketing: Problems and Solutions , 1998, KDD.

[19]  Catherine Blake,et al.  UCI Repository of machine learning databases , 1998 .

[20]  Jorma Laurikkala,et al.  Improving Identification of Difficult Small Classes by Balancing Class Distribution , 2001, AIME.

[21]  JapkowiczNathalie,et al.  The class imbalance problem: A systematic study , 2002 .

[22]  Tom Fawcett,et al.  An introduction to ROC analysis , 2006, Pattern Recognit. Lett..

[23]  Hui Han,et al.  Borderline-SMOTE: A New Over-Sampling Method in Imbalanced Data Sets Learning , 2005, ICIC.

[24]  Yoav Freund,et al.  A decision-theoretic generalization of on-line learning and an application to boosting , 1995, EuroCOLT.

[25]  S. Clearwater,et al.  A rule-learning program in high energy physics event classification , 1991 .

[26]  Janez Demsar,et al.  Statistical Comparisons of Classifiers over Multiple Data Sets , 2006, J. Mach. Learn. Res..

[27]  Hui Xiong,et al.  Local decomposition for rare class analysis , 2007, KDD '07.

[28]  I. Tomek,et al.  Two Modifications of CNN , 1976 .

[29]  David Mease,et al.  Boosted Classification Trees and Class Probability/Quantile Estimation , 2007, J. Mach. Learn. Res..

[30]  David A. Cieslak,et al.  Start Globally, Optimize Locally, Predict Globally: Improving Performance on Imbalanced Data , 2008, 2008 Eighth IEEE International Conference on Data Mining.

[31]  C. R. Rao,et al.  An Alternative to Correspondence Analysis Using Hellinger Distance. , 1997 .

[32]  Herna L. Viktor,et al.  Learning from imbalanced data sets with boosting and data generation: the DataBoost-IM approach , 2004, SKDD.

[33]  Gustavo E. A. P. A. Batista,et al.  A study of the behavior of several methods for balancing machine learning training data , 2004, SKDD.

[34]  Yoav Freund,et al.  Experiments with a New Boosting Algorithm , 1996, ICML.

[35]  Tom Fawcett,et al.  ROC Graphs: Notes and Practical Considerations for Data Mining Researchers , 2003 .

[36]  Haibo He,et al.  ADASYN: Adaptive synthetic sampling approach for imbalanced learning , 2008, 2008 IEEE International Joint Conference on Neural Networks (IEEE World Congress on Computational Intelligence).

[37]  David A. Cieslak,et al.  Automatically countering imbalance and its empirical relationship to cost , 2008, Data Mining and Knowledge Discovery.

[38]  David D. Lewis,et al.  Heterogeneous Uncertainty Sampling for Supervised Learning , 1994, ICML.

[39]  Foster Provost,et al.  Machine Learning from Imbalanced Data Sets 101 , 2008 .

[40]  Ellen M. Voorhees,et al.  Implementing agglomerative hierarchic clustering algorithms for use in document retrieval , 1986, Inf. Process. Manag..

[41]  Gustavo E. A. P. A. Batista,et al.  Class Imbalances versus Class Overlapping: An Analysis of a Learning System Behavior , 2004, MICAI.

[42]  Tom Fawcett,et al.  Adaptive Fraud Detection , 1997, Data Mining and Knowledge Discovery.

[43]  Xin Yao,et al.  Multiclass Imbalance Problems: Analysis and Potential Solutions , 2012, IEEE Transactions on Systems, Man, and Cybernetics, Part B (Cybernetics).

[44]  Haibo He,et al.  RAMOBoost: Ranked Minority Oversampling in Boosting , 2010, IEEE Transactions on Neural Networks.

[45]  D. Opitz,et al.  Popular Ensemble Methods: An Empirical Study , 1999, J. Artif. Intell. Res..

[46]  Nathalie Japkowicz,et al.  The class imbalance problem: A systematic study , 2002, Intell. Data Anal..