Cross-Project Transfer Representation Learning for Vulnerable Function Discovery

Machine learning is now widely used to detect security vulnerabilities in the software, even before the software is released. But its potential is often severely compromised at the early stage of a software project when we face a shortage of high-quality training data and have to rely on overly generic hand-crafted features. This paper addresses this cold-start problem of machine learning, by learning rich features that generalize across similar projects. To reach an optimal balance between feature-richness and generalizability, we devise a data-driven method including the following innovative ideas. First, the code semantics are revealed through serialized abstract syntax trees (ASTs), with tokens encoded by Continuous Bag-of-Words neural embeddings. Next, the serialized ASTs are fed to a sequential deep learning classifier (Bi-LSTM) to obtain a representation indicative of software vulnerability. Finally, the neural representation obtained from existing software projects is then transferred to the new project to enable early vulnerability detection even with a small set of training labels. To validate this vulnerability detection approach, we manually labeled 457 vulnerable functions and collected 30 000+ nonvulnerable functions from six open-source projects. The empirical results confirmed that the trained model is capable of generating representations that are indicative of program vulnerability and is adaptable across multiple projects. Compared with the traditional code metrics, our transfer-learned representations are more effective for predicting vulnerable functions, both within a project and across multiple projects.

[1]  Lukás Burget,et al.  Recurrent neural network based language model , 2010, INTERSPEECH.

[2]  Shouhuai Xu,et al.  VulPecker: an automated vulnerability detection system based on code similarity analysis , 2016, ACSAC.

[3]  Konrad Rieck,et al.  Chucky: exposing missing checks in source code for vulnerability discovery , 2013, CCS.

[4]  Felix FX Lindner,et al.  Vulnerability Extrapolation: Assisted Discovery of Vulnerabilities Using Machine Learning , 2011, WOOT.

[5]  Laurie A. Williams,et al.  Secure open source collaboration: an empirical study of linus' law , 2009, CCS.

[6]  Jürgen Schmidhuber,et al.  Long Short-Term Memory , 1997, Neural Computation.

[7]  Jun Zhang,et al.  POSTER: Vulnerability Discovery with Function Representation Learning from Unlabeled Projects , 2017, CCS.

[8]  Mohammad Zulkernine,et al.  Using complexity, coupling, and cohesion metrics as early indicators of vulnerabilities , 2011, J. Syst. Archit..

[9]  Leon Moonen,et al.  Generating robust parsers using island grammars , 2001, Proceedings Eighth Working Conference on Reverse Engineering.

[10]  Kuldip K. Paliwal,et al.  Bidirectional recurrent neural networks , 1997, IEEE Trans. Signal Process..

[11]  Dawson R. Engler,et al.  Bugs as deviant behavior: a general approach to inferring errors in systems code , 2001, SOSP.

[12]  Dawson R. Engler,et al.  KLEE: Unassisted and Automatic Generation of High-Coverage Tests for Complex Systems Programs , 2008, OSDI.

[13]  Gaël Varoquaux,et al.  Scikit-learn: Machine Learning in Python , 2011, J. Mach. Learn. Res..

[14]  Heejo Lee,et al.  VUDDY: A Scalable Approach for Vulnerable Code Clone Discovery , 2017, 2017 IEEE Symposium on Security and Privacy (SP).

[15]  Laurie A. Williams,et al.  Evaluating Complexity, Code Churn, and Developer Activity Metrics as Indicators of Software Vulnerabilities , 2011, IEEE Transactions on Software Engineering.

[16]  David Brumley,et al.  ReDeBug: Finding Unpatched Code Clones in Entire OS Distributions , 2012, 2012 IEEE Symposium on Security and Privacy.

[17]  Matthew Smith,et al.  VCCFinder: Finding Potential Vulnerabilities in Open-Source Projects to Assist Code Audits , 2015, CCS.

[18]  James Newsom,et al.  Dynamic Taint Analysis for Automatic Detection, Analysis, and Signature Generation of Exploits on Commodity Software, Network and Distributed System Security Symposium Conference Proceedings : 2005 , 2005 .

[19]  Yuan Yu,et al.  TensorFlow: A system for large-scale machine learning , 2016, OSDI.

[20]  Jeffrey Dean,et al.  Efficient Estimation of Word Representations in Vector Space , 2013, ICLR.

[21]  Konrad Rieck,et al.  Generalized vulnerability extrapolation using abstract syntax trees , 2012, ACSAC '12.

[22]  Andreas Zeller,et al.  Predicting vulnerable software components , 2007, CCS '07.