Evaluating Fuzz Testing

Fuzz testing has enjoyed great success at discovering security critical bugs in real software. Recently, researchers have devoted significant effort to devising new fuzzing techniques, strategies, and algorithms. Such new ideas are primarily evaluated experimentally so an important question is: What experimental setup is needed to produce trustworthy results? We surveyed the recent research literature and assessed the experimental evaluations carried out by 32 fuzzing papers. We found problems in every evaluation we considered. We then performed our own extensive experimental evaluation using an existing fuzzer. Our results showed that the general problems we found in existing experimental evaluations can indeed translate to actual wrong or misleading assessments. We conclude with some guidelines that we hope will help improve experimental evaluations of fuzz testing algorithms, making reported results more robust.

[1]  Bin Zhang,et al.  S2F: Discover Hard-to-Reach Vulnerabilities by Semi-Symbolic Fuzz Testing , 2017, 2017 13th International Conference on Computational Intelligence and Security (CIS).

[2]  Vrizlynn L. L. Thing,et al.  A hybrid symbolic execution assisted fuzzing method , 2017, TENCON 2017 - 2017 IEEE Region 10 Conference.

[3]  Herbert Bos,et al.  VUzzer: Application-aware Evolutionary Fuzzing , 2017, NDSS.

[4]  Sang Kil Cha,et al.  IMF: Inferred Model-based Fuzzer , 2017, CCS.

[5]  David A. Wagner,et al.  Dynamic Test Generation to Find Integer Bugs in x86 Binary Linux Programs , 2009, USENIX Security Symposium.

[6]  Yang Liu,et al.  Skyfire: Data-Driven Seed Generation for Fuzzing , 2017, 2017 IEEE Symposium on Security and Privacy (SP).

[7]  Ofer Strichman,et al.  Local Restarts , 2008, SAT.

[8]  Pablo Buiras,et al.  QuickFuzz testing for fun and profit , 2017, J. Syst. Softw..

[9]  Alex Groce,et al.  Taming compiler fuzzers , 2013, PLDI.

[10]  Christopher Krügel,et al.  Driller: Augmenting Fuzzing Through Selective Symbolic Execution , 2016, NDSS.

[11]  Weiguang Wang,et al.  SeededFuzz: Selecting and Generating Seeds for Directed Fuzzing , 2016, 2016 10th International Symposium on Theoretical Aspects of Software Engineering (TASE).

[12]  Taeshik Shon,et al.  Grammar-based adaptive fuzzing: Evaluation on SCADA modbus protocol , 2016, 2016 IEEE International Conference on Smart Grid Communications (SmartGridComm).

[13]  Derek Bruening,et al.  AddressSanitizer: A Fast Address Sanity Checker , 2012, USENIX Annual Technical Conference.

[14]  Antonio Ken Iannillo,et al.  Chizpurfle: A Gray-Box Android Fuzzer for Vendor Service Customizations , 2017, 2017 IEEE 28th International Symposium on Software Reliability Engineering (ISSRE).

[15]  Christopher Krügel,et al.  DIFUZE: Interface Aware Fuzzing for Kernel Drivers , 2017, CCS.

[16]  David Brumley,et al.  Program-Adaptive Mutational Fuzzing , 2015, 2015 IEEE Symposium on Security and Privacy.

[17]  Heng Yin,et al.  VDF: Targeted Evolutionary Fuzz Testing of Virtual Devices , 2017, RAID.

[18]  David Brumley,et al.  Unleashing Mayhem on Binary Code , 2012, 2012 IEEE Symposium on Security and Privacy.

[19]  Lionel C. Briand,et al.  A practical guide for using statistical tests to assess randomized algorithms in software engineering , 2011, 2011 33rd International Conference on Software Engineering (ICSE).

[20]  Alex Groce,et al.  Code coverage for suite evaluation by developers , 2014, ICSE.

[21]  S. Vowler,et al.  Different tests for a difference: how do we do research? , 2012, Experimental physiology.

[22]  Nahid Shahmehri,et al.  Turning programs against each other: high coverage fuzz-testing using binary-code mutation and dynamic slicing , 2015, ESEC/SIGSOFT FSE.

[23]  Mathias Payer,et al.  T-Fuzz: Fuzzing by Program Transformation , 2018, 2018 IEEE Symposium on Security and Privacy (SP).

[24]  Pablo Buiras,et al.  QuickFuzz: an automatic random fuzzer for common file formats , 2016, Haskell.

[25]  Reid Holmes,et al.  Coverage is not strongly correlated with test suite effectiveness , 2014, ICSE.

[26]  Koushik Sen,et al.  FairFuzz: A Targeted Mutation Strategy for Increasing Greybox Fuzz Testing Coverage , 2018, 2018 33rd IEEE/ACM International Conference on Automated Software Engineering (ASE).

[27]  William K. Robertson,et al.  LAVA: Large-Scale Automated Vulnerability Addition , 2016, 2016 IEEE Symposium on Security and Privacy (SP).

[28]  Yang Liu,et al.  Steelix: program-state based binary fuzzing , 2017, ESEC/SIGSOFT FSE.

[29]  Hao Chen,et al.  Angora: Efficient Fuzzing by Principled Search , 2018, 2018 IEEE Symposium on Security and Privacy (SP).

[30]  David Brumley,et al.  Scheduling black-box mutational fuzzing , 2013, CCS.

[31]  Matthias Hauswirth,et al.  Producing wrong data without doing anything obviously wrong! , 2009, ASPLOS.

[32]  Gordon B Drummond,et al.  Making do with what we have: use your bootstraps , 2012, Experimental physiology.

[33]  Angelos D. Keromytis,et al.  SlowFuzz: Automated Domain-Independent Detection of Algorithmic Complexity Vulnerabilities , 2017, CCS.

[34]  Sarah L Vowler,et al.  Making do with what we have: use your bootstraps , 2012, The Journal of physiology.

[35]  Shih-Kun Huang,et al.  Browser fuzzing by scheduled mutation and generation of document object models , 2015, 2015 International Carnahan Conference on Security Technology (ICCST).

[36]  A. Vargha,et al.  A Critique and Improvement of the CL Common Language Effect Size Statistics of McGraw and Wong , 2000 .

[37]  Abhik Roychoudhury,et al.  Directed Greybox Fuzzing , 2017, CCS.

[38]  Sebastian Schinzel,et al.  kAFL: Hardware-Assisted Feedback Fuzzing for OS Kernels , 2017, USENIX Security Symposium.

[39]  Salvatore J. Stolfo,et al.  NEZHA: Efficient Domain-Independent Differential Testing , 2017, 2017 IEEE Symposium on Security and Privacy (SP).

[40]  David Brumley,et al.  Optimizing Seed Selection for Fuzzing , 2014, USENIX Security Symposium.

[41]  R. Lyman Ott,et al.  Introduction to Statistical Methods and Data Analysis (with CD-ROM) , 2006 .

[42]  David Lo,et al.  Code coverage and test suite effectiveness: Empirical study with real bugs in large systems , 2015, 2015 IEEE 22nd International Conference on Software Analysis, Evolution, and Reengineering (SANER).

[43]  Herbert Bos,et al.  Dowsing for Overflows: A Guided Fuzzer to Find Buffer Boundary Violations , 2013, USENIX Security Symposium.

[44]  Anja Feldmann,et al.  Static Program Analysis as a Fuzzing Aid , 2017, RAID.

[45]  Insik Shin,et al.  Enhancing Memory Error Detection for Large-Scale Applications and Fuzz Testing , 2018, NDSS.

[46]  Gordon B Drummond,et al.  Different tests for a difference: how do we do research? , 2012, British journal of pharmacology.

[47]  Claire Le Goues,et al.  Semantic Crash Bucketing , 2018, 2018 33rd IEEE/ACM International Conference on Automated Software Engineering (ASE).

[48]  Abhik Roychoudhury,et al.  Coverage-Based Greybox Fuzzing as Markov Chain , 2016, IEEE Transactions on Software Engineering.

[49]  Abhik Roychoudhury,et al.  Bucketing Failing Tests via Symbolic Analysis , 2017, FASE.

[50]  Wen Xu,et al.  Designing New Operating Primitives to Improve Fuzzing Performance , 2017, CCS.

[51]  Amer Diwan,et al.  The DaCapo benchmarks: java benchmarking development and analysis , 2006, OOPSLA '06.