A Novel Support Vector Machine Approach to High Entropy Data Fragment Classification

A major challenge in digital forensics is the efficient and accurate file type classification of a fragment of evidence data, in the absence of header and file system information. A typical approach to this problem is to classify the fragment based on simple statistics, such as the entropy and the statistical distance of byte histograms. This approach is ineffective when dealing with high entropy data, such as multimedia and compressed files, all of which often appear to be random. We propose a method incorporating a support vector machine (SVM). In particular, we extract feature vectors from the byte frequencies of a given fragment, and use an SVM to predict the type of the fragment under supervised learning. Our method is efficient and achieves high accuracy for high entropy data fragments.

[1]  Cor J. Veenman Statistical Disk Cluster Classification for File Carving , 2007 .

[2]  N. Shahmehri,et al.  File Type Identification of Data Fragments by Their Binary Structure , 2006, 2006 IEEE Information Assurance Workshop.

[3]  Bernhard E. Boser,et al.  A training algorithm for optimal margin classifiers , 1992, COLT '92.

[4]  M Damashek,et al.  Gauging Similarity with n-Grams: Language-Independent Categorization of Text , 1995, Science.

[5]  Chih-Jen Lin,et al.  LIBSVM: A library for support vector machines , 2011, TIST.

[6]  Robert F. Erbacher,et al.  SÁDI - Statistical Analysis for Data Type Identification , 2008, 2008 Third International Workshop on Systematic Approaches to Digital Forensic Engineering.

[7]  Cor J. Veenman Statistical Disk Cluster Classification for File Carving , 2007, Third International Symposium on Information Assurance and Security.

[8]  Robert F. Erbacher,et al.  Identification and Localization of Data Types within Large-Scale File Systems , 2007, Second International Workshop on Systematic Approaches to Digital Forensic Engineering (SADFE'07).

[9]  Gregory A. Hall,et al.  Sliding Window Measurement for File Type Identification , 2007 .

[10]  Mohammad Hossain Heydari,et al.  Content based file type detection algorithms , 2003, 36th Annual Hawaii International Conference on System Sciences, 2003. Proceedings of the.

[11]  Corinna Cortes,et al.  Support-Vector Networks , 1995, Machine Learning.

[12]  Ke Wang,et al.  Fileprints: identifying file types by n-gram analysis , 2005, Proceedings from the Sixth Annual IEEE SMC Information Assurance Workshop.

[13]  Drue Coles,et al.  Predicting the types of file fragments , 2008, Digit. Investig..

[14]  Nahid Shahmehri,et al.  Oscar - File Type Identification of Binary Data in Disk Clusters and RAM Pages , 2006, SEC.