A practical guide for using statistical tests to assess randomized algorithms in software engineering

Randomized algorithms have been used to successfully address many different types of software engineering problems. This type of algorithms employ a degree of randomness as part of their logic. Randomized algorithms are useful for difficult problems where a precise solution cannot be derived in a deterministic way within reasonable time. However, randomized algorithms produce different results on every run when applied to the same problem instance. It is hence important to assess the effectiveness of randomized algorithms by collecting data from a large enough number of runs. The use of rigorous statistical tests is then essential to provide support to the conclusions derived by analyzing such data. In this paper, we provide a systematic review of the use of randomized algorithms in selected software engineering venues in 2009. Its goal is not to perform a complete survey but to get a representative snapshot of current practice in software engineering research. We show that randomized algorithms are used in a significant percentage of papers but that, in most cases, randomness is not properly accounted for. This casts doubts on the validity of most empirical results assessing randomized algorithms. There are numerous statistical tests, based on different assumptions, and it is not always clear when and how to use these tests. We hence provide practical guidelines to support empirical research on randomized algorithms in software engineering

[1]  Feller William,et al.  An Introduction To Probability Theory And Its Applications , 1950 .

[2]  Maria Luisa Villani,et al.  An approach for QoS-aware service composition based on genetic algorithms , 2005, GECCO '05.

[3]  Martin Erwig,et al.  Mutation Operators for Spreadsheets , 2009, IEEE Transactions on Software Engineering.

[4]  J. Rice Mathematical Statistics and Data Analysis , 1988 .

[5]  Lionel C. Briand,et al.  Black-Box System Testing of Real-Time Embedded Systems Using Random and Search-Based Testing , 2010, ICTSS.

[6]  Paolo Tonella,et al.  Evolutionary testing of classes , 2004, ISSTA '04.

[7]  Andrea Arcuri,et al.  Full Theoretical Runtime Analysis of Alternating Variable Method on the Triangle Classification Problem , 2009, 2009 1st International Symposium on Search Based Software Engineering.

[8]  Isabel M. Ramos,et al.  An evolutionary approach to estimating software development projects , 2001, Inf. Softw. Technol..

[9]  John A. Clark,et al.  Widening the Goal Posts: Program Stretching to Aid Search Based Software Testing , 2009, 2009 1st International Symposium on Search Based Software Engineering.

[10]  Mitchell H. Katz,et al.  Multivariable Analysis: A Practical Guide for Clinicians , 1999 .

[11]  William Feller,et al.  An Introduction to Probability Theory and Its Applications , 1967 .

[12]  Koushik Sen,et al.  DART: directed automated random testing , 2005, PLDI '05.

[13]  S. Goodman,et al.  p values, hypothesis tests, and likelihood: implications for epidemiology of a neglected historical debate. , 1993, American journal of epidemiology.

[14]  Nancy L. Leech,et al.  A Call for Greater Use of Nonparametric Statistics , 2002 .

[15]  Spiros Mancoridis,et al.  On the Use of Discretized Source Code Metrics for Author Identification , 2009, 2009 1st International Symposium on Search Based Software Engineering.

[16]  Tore Dybå,et al.  A systematic review of effect size in software engineering experiments , 2007, Inf. Softw. Technol..

[17]  Iain Bate,et al.  WCET analysis of modern processors using multi-criteria optimisation , 2009, 2009 1st International Symposium on Search Based Software Engineering.

[18]  Taghi M. Khoshgoftaar,et al.  A multiobjective module-order model for software quality enhancement , 2004, IEEE Transactions on Evolutionary Computation.

[19]  McMinnPhil Search-based software test data generation: a survey , 2004 .

[20]  Thomas Bäck,et al.  An analysis of the behavior of simplified evolutionary algorithms on trap functions , 2003, IEEE Trans. Evol. Comput..

[21]  G. Ruxton The unequal variance t-test is an underused alternative to Student's t-test and the Mann–Whitney U test , 2006 .

[22]  Paolo Tonella,et al.  Search-Based Testing of Ajax Web Applications , 2009, 2009 1st International Symposium on Search Based Software Engineering.

[23]  Günther Ruhe,et al.  Optimized Resource Allocation for Software Release Planning , 2009, IEEE Transactions on Software Engineering.

[24]  R. Grissom,et al.  Effect sizes for research: A broad practical approach. , 2005 .

[25]  Catherine Beverley,et al.  Systematic reviews to support evidence-based medicine: how to review and apply findings of healthcare research , 2004 .

[26]  Claire Le Goues,et al.  Automatically finding patches using genetic programming , 2009, 2009 IEEE 31st International Conference on Software Engineering.

[27]  Keith D. Cooper,et al.  Optimizing for reduced code space using genetic algorithms , 1999, LCTES '99.

[28]  Mark Harman,et al.  A Theoretical and Empirical Study of Search-Based Testing: Local, Global, and Hybrid Search , 2010, IEEE Transactions on Software Engineering.

[29]  Yuanyuan Zhang,et al.  Search Based Software Engineering: A Comprehensive Analysis and Review of Trends Techniques and Applications , 2009 .

[30]  Alessandro Orso,et al.  MINTS: A general framework and tool for supporting test-suite minimization , 2009, 2009 IEEE 31st International Conference on Software Engineering.

[31]  Lionel C. Briand,et al.  A Systematic Review of the Application and Empirical Investigation of Search-Based Test Case Generation , 2010, IEEE Transactions on Software Engineering.

[32]  Lionel C. Briand,et al.  Formal analysis of the effectiveness and predictability of random testing , 2010, ISSTA '10.

[33]  Charles M. Grinstead,et al.  Introduction to probability , 1999, Statistics for the Behavioural Sciences.

[34]  R. Wilcox Fundamentals of Modern Statistical Methods: Substantially Improving Power and Accuracy , 2001 .

[35]  Xin Yao,et al.  A novel co-evolutionary approach to automatic software bug fixing , 2008, 2008 IEEE Congress on Evolutionary Computation (IEEE World Congress on Computational Intelligence).

[36]  Victor J. Rayward-Smith,et al.  The next release problem , 2001, Inf. Softw. Technol..

[37]  M Bateson,et al.  Systematic Reviews to Support Evidence-Based Medicine: How to Review and Apply Findings of Healthcare Research. , 2004 .

[38]  Martin C. Rinard,et al.  Taint-based directed whitebox fuzzing , 2009, 2009 IEEE 31st International Conference on Software Engineering.

[39]  Myra B. Cohen,et al.  An Improved Meta-heuristic Search for Constrained Interaction Testing , 2009, 2009 1st International Symposium on Search Based Software Engineering.

[40]  Tore Dybå,et al.  A systematic review of statistical power in software engineering experiments , 2006, Inf. Softw. Technol..

[41]  Shinichi Nakagawa A farewell to Bonferroni: the problems of low statistical power and publication bias , 2004, Behavioral Ecology.

[42]  Russ Bubley,et al.  Randomized algorithms , 1995, CSUR.

[43]  John A. Clark,et al.  Efficient Software Verification: Statistical Testing Using Automated Search , 2010, IEEE Transactions on Software Engineering.

[44]  T. Perneger What's wrong with Bonferroni adjustments , 1998, BMJ.

[45]  Luis V. García,et al.  Escaping the Bonferroni iron claw in ecological studies , 2004 .

[46]  Spiros Mancoridis,et al.  On the automatic modularization of software systems using the Bunch tool , 2006, IEEE Transactions on Software Engineering.

[47]  I. Cuthill,et al.  Effect size, confidence interval and statistical significance: a practical guide for biologists , 2007, Biological reviews of the Cambridge Philosophical Society.

[48]  Sooyong Park,et al.  Dynamic Architectural Selection: A Genetic Algorithm Based Approach , 2009, 2009 1st International Symposium on Search Based Software Engineering.

[49]  R. Blair,et al.  A more realistic look at the robustness and Type II error properties of the t test to departures from population normality. , 1992 .

[50]  Jun Yan Survival Analysis: Techniques for Censored and Truncated Data , 2004 .

[51]  Michael D. Ernst,et al.  Automatic creation of SQL Injection and cross-site scripting attacks , 2009, 2009 IEEE 31st International Conference on Software Engineering.

[52]  A. Vargha,et al.  A Critique and Improvement of the CL Common Language Effect Size Statistics of McGraw and Wong , 2000 .

[53]  Giuliano Antoniol,et al.  Evolution and Search Based Metrics to Improve Defects Prediction , 2009, 2009 1st International Symposium on Search Based Software Engineering.

[54]  P. A. P. Moran,et al.  An introduction to probability theory , 1968 .

[55]  M. Cowles,et al.  On the Origins of the . 05 Level of Statistical Significance , 2005 .

[56]  M. Fay,et al.  Wilcoxon-Mann-Whitney or t-test? On assumptions for hypothesis tests and multiple interpretations of decision rules. , 2010, Statistics surveys.

[57]  Arif Ghafoor,et al.  Scalable and Effective Test Generation for Role-Based Access Control Systems , 2009, IEEE Transactions on Software Engineering.

[58]  Pearl Brereton,et al.  Systematic literature reviews in software engineering - A systematic literature review , 2009, Inf. Softw. Technol..

[59]  Simeon C. Ntafos,et al.  An Evaluation of Random Testing , 1984, IEEE Transactions on Software Engineering.

[60]  Barry W. Boehm,et al.  How to avoid drastic software process change (using stochastic stability) , 2009, 2009 IEEE 31st International Conference on Software Engineering.

[61]  Thomas Thüm,et al.  Reasoning about edits to feature models , 2009, 2009 IEEE 31st International Conference on Software Engineering.

[62]  Jacob Cohen Statistical Power Analysis for the Behavioral Sciences , 1969, The SAGE Encyclopedia of Research Design.

[63]  Günter Rudolph,et al.  Convergence analysis of canonical genetic algorithms , 1994, IEEE Trans. Neural Networks.

[64]  S. Goodman Toward Evidence-Based Medical Statistics. 1: The P Value Fallacy , 1999, Annals of Internal Medicine.

[65]  Enrique Alba,et al.  A Study of the Multi-objective Next Release Problem , 2009, 2009 1st International Symposium on Search Based Software Engineering.