Detecting a malicious executable without prior knowledge of its patterns

To detect malicious executables, often spread as email attachments, two types of algorithms are usually applied under instance-based statistical learning paradigms: (1) Signature-based template matching, which finds unique tell-tale characteristics of a malicious executable and thus is capable of matching those with known signatures; (2) Two-class supervised learning, which determines a set of features that allow benign and malicious patterns to occupy a disjoint regions in a feature vector space and thus probabilistically identifies malicious executables with the similar features. Nevertheless, given the huge potential variety of malicious executables, we cannot be confident that existing training sets adequately represent the class as a whole. In this study, we investigated the use of byte sequence frequencies to profile only benign data. The malicious executables are identified as outliers or anomalies that significantly deviate from the normal profile. A multivariate Gaussian likelihood model, fit with a Principal Component Analysis (PCA), was compared with a one-class Support Vector Machine (SVM) model for characterizing the benign executables. We found that the Gaussian model substantially outperformed the one-class SVM in its ability to distinguish malicious from benign files. Complementing to the capabilities in reliably detecting those malicious files with known or similar features using two aforementioned methods, the one-class unsupervised approach may provide another layer of safeguard in identifying those novel computer viruses.

[1]  Alessandro Verri,et al.  Pattern recognition with support vector machines : First International Workshop, SVM 2002, Niagara Falls, Canada, August 10, 2002 : proceedings , 2002 .

[2]  rey O. Kephart,et al.  Automatic Extraction of Computer Virus SignaturesJe , 2006 .

[3]  Salvatore J. Stolfo,et al.  USENIX Association Proceedings of the FREENIX Track : 2001 USENIX Annual , 2001 .

[4]  Stephen J. Roberts,et al.  A Validation Index For Artificial Neural Networks , 1996 .

[5]  Stephanie Forrest,et al.  Computer immunology , 1997, CACM.

[6]  Robert P. W. Duin,et al.  Support vector domain description , 1999, Pattern Recognit. Lett..

[7]  Bernhard Schölkopf,et al.  Estimating the Support of a High-Dimensional Distribution , 2001, Neural Computation.

[8]  Corinna Cortes,et al.  Support-Vector Networks , 1995, Machine Learning.

[9]  J. E. Jackson A User's Guide to Principal Components , 1991 .

[10]  Fred Cohen A cryptographic checksum for integrity protection , 1987, Comput. Secur..

[11]  Maya Gokhale,et al.  Comparison of feature selection and classification algorithms in identifying malicious executables , 2007, Comput. Stat. Data Anal..

[12]  Salvatore J. Stolfo,et al.  Data mining methods for detection of new malicious executables , 2001, Proceedings 2001 IEEE Symposium on Security and Privacy. S&P 2001.

[13]  Heekuck Oh,et al.  Neural Networks for Pattern Recognition , 1993, Adv. Comput..

[14]  Eleazar Eskin,et al.  A GEOMETRIC FRAMEWORK FOR UNSUPERVISED ANOMALY DETECTION: DETECTING INTRUSIONS IN UNLABELED DATA , 2002 .

[15]  Vic Barnett,et al.  Outliers in Statistical Data , 1980 .

[16]  Gerald Tesauro,et al.  Neural networks for computer virus recognition , 1996 .

[17]  Vladimir Vapnik,et al.  Statistical learning theory , 1998 .

[18]  J. Edward Jackson,et al.  A User's Guide to Principal Components. , 1991 .

[19]  Karl N. Levitt,et al.  MCF: a malicious code filter , 1995, Comput. Secur..

[20]  Stephen D. Crocker,et al.  A proposal for a verification-based virus filter , 1989, Proceedings. 1989 IEEE Symposium on Security and Privacy.

[21]  Karl N. Levitt,et al.  Towards a testbed for malicious code detection , 1991, COMPCON Spring '91 Digest of Papers.