Using Class Imbalance Learning for Software Defect Prediction

To facilitate software testing, and save testing costs, a wide range of machine learning methods have been studied to predict defects in software modules. Unfortunately, the imbalanced nature of this type of data increases the learning difficulty of such a task. Class imbalance learning specializes in tackling classification problems with imbalanced distributions, which could be helpful for defect prediction, but has not been investigated in depth so far. In this paper, we study the issue of if and how class imbalance learning methods can benefit software defect prediction with the aim of finding better solutions. We investigate different types of class imbalance learning methods, including resampling techniques, threshold moving, and ensemble algorithms. Among those methods we studied, AdaBoost.NC shows the best overall performance in terms of the measures including balance, G-mean, and Area Under the Curve (AUC). To further improve the performance of the algorithm, and facilitate its use in software defect prediction, we propose a dynamic version of AdaBoost.NC, which adjusts its parameter automatically during training. Without the need to pre-define any parameters, it is shown to be more effective and efficient than the original AdaBoost.NC.

[1]  Jeffrey C. Carver,et al.  Characterizing Software Architecture Changes: An Initial Study , 2007, ESEM 2007.

[2]  Shuo Wang,et al.  Ensemble diversity for class imbalance learning , 2011 .

[3]  Nitesh V. Chawla,et al.  C4.5 and Imbalanced Data sets: Investigating the eect of sampling method, probabilistic estimate, and decision tree structure , 2003 .

[4]  Akito Monden,et al.  The Effects of Over and Under Sampling on Fault-prone Module Detection , 2007, First International Symposium on Empirical Software Engineering and Measurement (ESEM 2007).

[5]  Peter Tiño,et al.  Managing Diversity in Regression Ensembles , 2005, J. Mach. Learn. Res..

[6]  Thomas J. Ostrand,et al.  \{PROMISE\} Repository of empirical software engineering data , 2007 .

[7]  Taghi M. Khoshgoftaar,et al.  Tree-based software quality estimation models for fault prediction , 2002, Proceedings Eighth IEEE Symposium on Software Metrics.

[8]  Taghi M. Khoshgoftaar,et al.  Experimental perspectives on learning from imbalanced data , 2007, ICML '07.

[9]  Tracy Hall,et al.  A Systematic Literature Review on Fault Prediction Performance in Software Engineering , 2012, IEEE Transactions on Software Engineering.

[10]  Gary M. Weiss Mining with rarity: a unifying framework , 2004, SKDD.

[11]  Lior Rokach,et al.  Ensemble-based classifiers , 2010, Artificial Intelligence Review.

[12]  Stan Matwin,et al.  Addressing the Curse of Imbalanced Training Sets: One-Sided Selection , 1997, ICML.

[13]  Maurice H. Halstead,et al.  Elements of software science , 1977 .

[14]  Huanhuan Chen,et al.  Negative correlation learning for classification ensembles , 2010, The 2010 International Joint Conference on Neural Networks (IJCNN).

[15]  Tim Menzies,et al.  Data Mining Static Code Attributes to Learn Defect Predictors , 2007, IEEE Transactions on Software Engineering.

[16]  Taghi M. Khoshgoftaar,et al.  Improving Software-Quality Predictions With Data Sampling and Boosting , 2009, IEEE Transactions on Systems, Man, and Cybernetics - Part A: Systems and Humans.

[17]  อนิรุธ สืบสิงห์,et al.  Data Mining Practical Machine Learning Tools and Techniques , 2014 .

[18]  Jun Zheng,et al.  Cost-sensitive boosting neural networks for software defect prediction , 2010, Expert Syst. Appl..

[19]  Andrew P. Bradley,et al.  The use of the area under the ROC curve in the evaluation of machine learning algorithms , 1997, Pattern Recognit..

[20]  Robert E. Schapire,et al.  The Boosting Approach to Machine Learning An Overview , 2003 .

[21]  J. Ross Quinlan,et al.  C4.5: Programs for Machine Learning , 1992 .

[22]  Lionel C. Briand,et al.  A systematic and comprehensive investigation of methods to build and evaluate fault prediction models , 2010, J. Syst. Softw..

[23]  Banu Diri,et al.  Investigating the effect of dataset size, metrics sets, and feature selection techniques on software fault prediction problem , 2009, Inf. Sci..

[24]  Xin Yao,et al.  An analysis of diversity measures , 2006, Machine Learning.

[25]  Scott Dick,et al.  Evaluating Stratification Alternatives to Improve Software Defect Prediction , 2012, IEEE Transactions on Reliability.

[26]  Stan Matwin,et al.  Machine Learning for the Detection of Oil Spills in Satellite Radar Images , 1998, Machine Learning.

[27]  Nathalie Japkowicz,et al.  A Novelty Detection Approach to Classification , 1995, IJCAI.

[28]  Haibo He,et al.  Learning from Imbalanced Data , 2009, IEEE Transactions on Knowledge and Data Engineering.

[29]  Leo Breiman,et al.  Random Forests , 2001, Machine Learning.

[30]  Taghi M. Khoshgoftaar,et al.  Attribute Selection and Imbalanced Data: Problems in Software Defect Prediction , 2010, 2010 22nd IEEE International Conference on Tools with Artificial Intelligence.

[31]  Bruce Christianson,et al.  Using the Support Vector Machine as a Classification Method for Software Defect Prediction with Static Code Metrics , 2009, EANN.

[32]  Bojan Cukic,et al.  A Statistical Framework for the Prediction of Fault-Proneness , 2007 .

[33]  Nitesh V. Chawla,et al.  SMOTEBoost: Improving Prediction of the Minority Class in Boosting , 2003, PKDD.

[34]  Tom Fawcett,et al.  ROC Graphs: Notes and Practical Considerations for Researchers , 2007 .

[35]  Taeho Jo,et al.  A Multiple Resampling Method for Learning from Imbalanced Data Sets , 2004, Comput. Intell..

[36]  Zhi-Hua Zhou,et al.  Ieee Transactions on Knowledge and Data Engineering 1 Training Cost-sensitive Neural Networks with Methods Addressing the Class Imbalance Problem , 2022 .

[37]  Ayse Basar Bener,et al.  Analysis of Naive Bayes' assumptions on software fault data: An empirical study , 2009, Data Knowl. Eng..

[38]  Xingquan Zhu,et al.  Lazy Bagging for Classifying Imbalanced Data , 2007, Seventh IEEE International Conference on Data Mining (ICDM 2007).

[39]  S. Dick,et al.  Applying Novel Resampling Strategies To Software Defect Prediction , 2007, NAFIPS 2007 - 2007 Annual Meeting of the North American Fuzzy Information Processing Society.

[40]  Tim Menzies,et al.  Problems with Precision , 2007 .

[41]  Tong-Seng Quah,et al.  Application of neural networks for software quality prediction using object-oriented metrics , 2005, J. Syst. Softw..

[42]  J C Riquelme,et al.  Finding Defective Modules from Highly Unbalanced Datasets , 2008 .

[43]  Burak Turhan,et al.  Implications of ceiling effects in defect predictors , 2008, PROMISE '08.

[44]  Cagatay Catal,et al.  Software fault prediction: A literature review and current trends , 2011, Expert Syst. Appl..

[45]  Alberto Maria Segre,et al.  Programs for Machine Learning , 1994 .

[46]  Nitesh V. Chawla,et al.  SMOTE: Synthetic Minority Over-sampling Technique , 2002, J. Artif. Intell. Res..

[47]  Anas N. Al-Rabadi,et al.  A comparison of modified reconstructability analysis and Ashenhurst‐Curtis decomposition of Boolean functions , 2004 .

[48]  Xin Yao,et al.  The Effectiveness of a New Negative Correlation Learning Algorithm for Classification Ensembles , 2010, 2010 IEEE International Conference on Data Mining Workshops.

[49]  Xin Yao,et al.  Ieee Transactions on Knowledge and Data Engineering 1 Relationships between Diversity of Classification Ensembles and Single-class Performance Measures , 2022 .

[50]  Sargur N. Srihari,et al.  Decision Combination in Multiple Classifier Systems , 1994, IEEE Trans. Pattern Anal. Mach. Intell..

[51]  Qinbao Song,et al.  A General Software Defect-Proneness Prediction Framework , 2011, IEEE Transactions on Software Engineering.

[52]  Xin Yao,et al.  Multiclass Imbalance Problems: Analysis and Potential Solutions , 2012, IEEE Transactions on Systems, Man, and Cybernetics, Part B (Cybernetics).

[53]  Taghi M. Khoshgoftaar,et al.  Cost-sensitive boosting in software quality modeling , 2002, 7th IEEE International Symposium on High Assurance Systems Engineering, 2002. Proceedings..

[54]  Kai Ming Ting,et al.  An Instance-Weighting Method to Induce Cost-Sensitive Trees , 2002, IEEE Trans. Knowl. Data Eng..