A novel automated approach for software effort estimation based on data augmentation

Software effort estimation (SEE) usually suffers from data scarcity problem due to the expensive or long process of data collection. As a result, companies usually have limited projects for effort estimation, causing unsatisfactory prediction performance. Few studies have investigated strategies to generate additional SEE data to aid such learning. We aim to propose a synthetic data generator to address the data scarcity problem of SEE. Our synthetic generator enlarges the SEE data set size by slightly displacing some randomly chosen training examples. It can be used with any SEE method as a data preprocessor. Its effectiveness is justified with 6 state-of-the-art SEE models across 14 SEE data sets. We also compare our data generator against the only existing approach in the SEE literature. Experimental results show that our synthetic projects can significantly improve the performance of some SEE methods especially when the training data is insufficient. When they cannot significantly improve the prediction performance, they are not detrimental either. Besides, our synthetic data generator is significantly superior or perform similarly to its competitor in the SEE literature. Therefore, our data generator plays a non-harmful if not significantly beneficial effect on the SEE methods investigated in this paper. Therefore, it is helpful in addressing the data scarcity problem of SEE.

[1]  Yong Hu,et al.  Systematic literature review of machine learning based software development effort estimation models , 2012, Inf. Softw. Technol..

[2]  Haibo He,et al.  Learning from Imbalanced Data , 2009, IEEE Transactions on Knowledge and Data Engineering.

[3]  Janez Demsar,et al.  Statistical Comparisons of Classifiers over Multiple Data Sets , 2006, J. Mach. Learn. Res..

[4]  Barry W. Boehm,et al.  Software Engineering Economics , 1993, IEEE Transactions on Software Engineering.

[5]  Stephen G. MacDonell,et al.  Evaluating prediction systems in software project estimation , 2012, Inf. Softw. Technol..

[6]  Michael E. Tipping,et al.  Analysis of Sparse Bayesian Learning , 2001, NIPS.

[7]  Akito Monden,et al.  An over-sampling method for analogy-based software effort estimation , 2008, ESEM '08.

[8]  Xin Yao,et al.  Software effort estimation as a multiobjective learning problem , 2013, TSEM.

[9]  Tim Menzies,et al.  Feature subset selection can improve software cost estimation accuracy , 2005, ACM SIGSOFT Softw. Eng. Notes.

[10]  Jeffrey C. Carver,et al.  Characterizing Software Architecture Changes: An Initial Study , 2007, ESEM 2007.

[11]  Magne Jørgensen,et al.  Better sure than safe? Over-confidence in judgement based software development effort prediction intervals , 2004, J. Syst. Softw..

[12]  Xin Yao,et al.  Using Class Imbalance Learning for Software Defect Prediction , 2013, IEEE Transactions on Reliability.

[13]  Kristin P. Bennett,et al.  A Pattern Search Method for Model Selection of Support Vector Regression , 2002, SDM.

[14]  F. Wilcoxon Individual Comparisons by Ranking Methods , 1945 .

[15]  Martin J. Shepperd,et al.  Estimating Software Project Effort Using Analogies , 1997, IEEE Trans. Software Eng..

[16]  Yoshua Bengio,et al.  Generative Adversarial Nets , 2014, NIPS.

[17]  Miguel-Ángel Sicilia,et al.  Software Project Effort Estimation Based on Multiple Parametric Models Generated Through Data Clustering , 2007, Journal of Computer Science and Technology.

[18]  Nitesh V. Chawla,et al.  SMOTE: Synthetic Minority Over-sampling Technique , 2002, J. Artif. Intell. Res..

[19]  Yunqian Ma,et al.  Practical selection of SVM parameters and noise estimation for SVM regression , 2004, Neural Networks.

[20]  S. Dick,et al.  Applying Novel Resampling Strategies To Software Defect Prediction , 2007, NAFIPS 2007 - 2007 Annual Meeting of the North American Fuzzy Information Processing Society.

[21]  Bart Baesens,et al.  Data Mining Techniques for Software Effort Estimation: A Comparative Study , 2012, IEEE Transactions on Software Engineering.

[22]  Bojan Cukic,et al.  Predicting more from less: Synergies of learning , 2013, 2013 2nd International Workshop on Realizing Artificial Intelligence Synergies in Software Engineering (RAISE).

[23]  Luís Torgo,et al.  SMOGN: a Pre-processing Approach for Imbalanced Regression , 2017, LIDTA@PKDD/ECML.

[24]  John E. Gaffney,et al.  Software Function, Source Lines of Code, and Development Effort Prediction: A Software Science Validation , 1983, IEEE Transactions on Software Engineering.

[25]  Adriano Lorena Inácio de Oliveira,et al.  Estimation of software project effort with support vector regression , 2006, Neurocomputing.

[26]  Emilia Mendes,et al.  Why comparative effort prediction studies may be invalid , 2009, PROMISE '09.

[27]  Peter A. Whigham,et al.  A Baseline Model for Software Effort Estimation , 2015, TSEM.

[28]  Barbara A. Kitchenham,et al.  A Simulation Study of the Model Evaluation Criterion MMRE , 2003, IEEE Trans. Software Eng..

[29]  Magne Jørgensen,et al.  How large are software cost overruns? A review of the 1994 CHAOS report , 2006, Inf. Softw. Technol..

[30]  George Eastman House,et al.  Sparse Bayesian Learning and the Relevance Vector Machine , 2001 .

[31]  Ingunn Myrtveit,et al.  Reliability and validity in comparative studies of software prediction models , 2005, IEEE Transactions on Software Engineering.

[32]  Ayse Basar Bener,et al.  Exploiting the Essential Assumptions of Analogy-Based Effort Estimation , 2012, IEEE Transactions on Software Engineering.

[33]  Stephen G. MacDonell,et al.  What accuracy statistics really measure , 2001, IEE Proc. Softw..

[34]  Carolyn Mair,et al.  The consistency of empirical comparisons of regression and analogy-based software project cost prediction , 2005, 2005 International Symposium on Empirical Software Engineering, 2005..

[35]  Andrew W. Fitzgibbon,et al.  Real-time human pose recognition in parts from single depth images , 2011, CVPR 2011.

[36]  J. G. Bryan,et al.  Introduction to probability and random variables , 1961 .

[37]  Bojan Cukic,et al.  Building a second opinion: learning cross-company data , 2013, PROMISE.

[38]  Magne Jørgensen,et al.  Software effort estimation by analogy and "regression toward the mean" , 2003, J. Syst. Softw..

[39]  Xin Yao,et al.  The potential benefit of relevance vector machine to software effort estimation , 2014, PROMISE.

[40]  Alexander J. Smola,et al.  Support Vector Regression Machines , 1996, NIPS.

[41]  Tim Menzies,et al.  Data Mining Static Code Attributes to Learn Defect Predictors , 2007, IEEE Transactions on Software Engineering.

[42]  Gary M. Weiss Mining with rarity: a unifying framework , 2004, SKDD.

[43]  Gordon Fraser,et al.  On Parameter Tuning in Search Based Software Engineering , 2011, SSBSE.

[44]  Ioannis Stamelos,et al.  Software productivity and effort prediction with ordinal regression , 2005, Inf. Softw. Technol..

[45]  Xin Yao,et al.  Ensembles and locality: Insight on improving software effort estimation , 2013, Inf. Softw. Technol..

[46]  Akito Monden,et al.  The Effects of Over and Under Sampling on Fault-prone Module Detection , 2007, First International Symposium on Empirical Software Engineering and Measurement (ESEM 2007).

[47]  Thomas J. Ostrand,et al.  \{PROMISE\} Repository of empirical software engineering data , 2007 .

[48]  Alaa F. Sheta,et al.  Estimating software effort and function point using regression, Support Vector Machine and Artificial Neural Networks models , 2015, 2015 IEEE/ACS 12th International Conference of Computer Systems and Applications (AICCSA).

[49]  Tim Menzies,et al.  On the Value of Ensemble Effort Estimation , 2012, IEEE Transactions on Software Engineering.

[50]  Dennis D. Boos,et al.  Essential Statistical Inference: Theory and Methods , 2013 .

[51]  Qinbao Song,et al.  A General Software Defect-Proneness Prediction Framework , 2011, IEEE Transactions on Software Engineering.

[52]  Taeho Jo,et al.  A Multiple Resampling Method for Learning from Imbalanced Data Sets , 2004, Comput. Intell..

[53]  A. Vargha,et al.  A Critique and Improvement of the CL Common Language Effect Size Statistics of McGraw and Wong , 2000 .

[54]  Ayse Basar Bener,et al.  ENNA: software effort estimation using ensemble of neural networks with associative memory , 2008, SIGSOFT '08/FSE-16.

[55]  Magne Jørgensen,et al.  Evidence-based guidelines for assessment of software development cost uncertainty , 2005, IEEE Transactions on Software Engineering.

[56]  Ioannis Stamelos,et al.  A Simulation Tool for Efficient Analogy Based Cost Estimation , 2000, Empirical Software Engineering.

[57]  Vladimir Vapnik,et al.  Statistical learning theory , 1998 .

[58]  Tim Menzies,et al.  Active learning and effort estimation: Finding the essential content of software effort estimation data , 2013, IEEE Transactions on Software Engineering.

[59]  Magne Jørgensen,et al.  The influence of selection bias on effort overruns in software development projects , 2013, Inf. Softw. Technol..

[60]  Lionel C. Briand,et al.  A practical guide for using statistical tests to assess randomized algorithms in software engineering , 2011, 2011 33rd International Conference on Software Engineering (ICSE).

[61]  Ashutosh Kumar Singh,et al.  The Elements of Statistical Learning: Data Mining, Inference, and Prediction , 2010 .

[62]  Xin Yao,et al.  The impact of parameter tuning on software effort estimation using learning machines , 2013, PROMISE.

[63]  Taghi M. Khoshgoftaar,et al.  Evolutionary Sampling and Software Quality Modeling of High-Assurance Systems , 2009, IEEE Transactions on Systems, Man, and Cybernetics - Part A: Systems and Humans.

[64]  Shari Lawrence Pfleeger,et al.  An empirical study of maintenance and development estimation accuracy , 2002, J. Syst. Softw..

[65]  Luís Torgo,et al.  Resampling strategies for regression , 2015, Expert Syst. J. Knowl. Eng..

[66]  Thong Ngee Goh,et al.  A study of project selection and feature weighting for analogy based software cost estimation , 2009, J. Syst. Softw..

[67]  Luís Torgo,et al.  SMOTE for Regression , 2013, EPIA.

[68]  Santanu Kumar Rath,et al.  Use Case Point Approach Based Software Effort Estimation using Various Support Vector Regression Kernel Methods , 2014, ArXiv.