Can cross-company data improve performance in software effort estimation?

Background: There has been a long debate in the software engineering literature concerning how useful cross-company (CC) data are for software effort estimation (SEE) in comparison to within-company (WC) data. Studies indicate that models trained on CC data obtain either similar or worse performance than models trained solely on WC data. Aims: We aim at investigating if CC data could help to increase performance and under what conditions. Method: The work concentrates on the fact that SEE is a class of online learning tasks which operate in changing environments, even though most work so far has neglected that. We conduct an analysis based on the performance of different approaches considering CC and WC data. These are: (1) an approach not designed for changing environments, (2) approaches designed for changing environments and (3) a new online learning approach able to identify when CC data are helpful or detrimental. Results: Interesting features of data sets commonly used in the SEE literature are revealed, showing that different subsets of CC data can be beneficial or detrimental depending on the moment in time. The newly proposed approach is able to benefit from that, successfully using CC data to improve performance over WC models. Conclusions: This work not only shows that CC data can help to increase performance for SEE tasks, but also demonstrates that the online nature of software prediction tasks should be exploited, being an important issue to be considered in the future.

[1]  Emilia Mendes,et al.  Investigating the use of chronological split for software effort estimation , 2009, IET Softw..

[2]  Qinbao Song,et al.  Dealing with missing software project data , 2003, Proceedings. 5th International Workshop on Enterprise Networking and Computing in Healthcare Industry (IEEE Cat. No.03EX717).

[3]  Emilia Mendes,et al.  Applying moving windows to software effort estimation , 2009, 2009 3rd International Symposium on Empirical Software Engineering and Measurement.

[4]  Barry W. Boehm,et al.  Software Engineering Economics , 1993, IEEE Transactions on Software Engineering.

[5]  Stephen G. MacDonell,et al.  Evaluating prediction systems in software project estimation , 2012, Inf. Softw. Technol..

[6]  Janez Demsar,et al.  Statistical Comparisons of Classifiers over Multiple Data Sets , 2006, J. Mach. Learn. Res..

[7]  Emilia Mendes,et al.  Using Chronological Splitting to Compare Cross- and Single-company Effort Models: Further Investigation , 2009, ACSC.

[8]  H. E. Dunsmore,et al.  Software engineering metrics and models , 1986 .

[9]  S. Muthukrishnan,et al.  Data streams: algorithms and applications , 2005, SODA '03.

[10]  Guilherme Horta Travassos,et al.  Cross versus Within-Company Cost Estimation Studies: A Systematic Review , 2007, IEEE Transactions on Software Engineering.

[11]  Tim Menzies,et al.  When to use data from other projects for effort estimation , 2010, ASE.

[12]  A. Bifet,et al.  Early Drift Detection Method , 2005 .

[13]  Emilia Mendes,et al.  Investigating the use of chronological splitting to compare software cross-company and single-company effort predictions , 2008 .

[14]  Xin Yao,et al.  A principled evaluation of ensembles of learning machines for software effort estimation , 2011, Promise '11.

[15]  Mark L. Mitchell,et al.  Research Design Explained , 1987 .

[16]  Ian H. Witten,et al.  The WEKA data mining software: an update , 2009, SKDD.

[17]  Xin Yao,et al.  Using unreliable data for creating more reliable online learners , 2012, The 2012 International Joint Conference on Neural Networks (IJCNN).

[18]  Xin Yao,et al.  The Impact of Diversity on Online Ensemble Learning in the Presence of Concept Drift , 2010, IEEE Transactions on Knowledge and Data Engineering.

[19]  Marcus A. Maloof,et al.  Using additive expert ensembles to cope with concept drift , 2005, ICML.

[20]  Xin Yao,et al.  DDD: A New Ensemble Approach for Dealing with Concept Drift , 2012, IEEE Transactions on Knowledge and Data Engineering.

[21]  Marcus A. Maloof,et al.  Dynamic Weighted Majority: An Ensemble Method for Drifting Concepts , 2007, J. Mach. Learn. Res..

[22]  Emilia Mendes,et al.  Investigating the Use of Chronological Splitting to Compare Software Cross-company and Single-company Effort Predictions: A Replicated Study , 2009, EASE.