CTDroid: Leveraging a Corpus of Technical Blogs for Android Malware Analysis

The rapid growth of Android malware results in a large body of approaches devoted to malware analysis by leveraging machine learning algorithms. However, the effectiveness of these approaches primarily depends on the manual feature engineering process, which is time-consuming and labor-intensive based on expert knowledge and intuition. In this paper, we propose an automatic approach that engineers informative features from a corpus of Android malware related technical blogs, which are written in a way that mirrors the human feature engineering process. However, there are two main challenges. First, it is difficult to recognize useful knowledge in the magnanimity information of thousands of blogs. To this end, we leverage natural language processing techniques to process the blogs and extract a set of sensitive behaviors that might do harmful activities to users potentially. Second, there exists a semantic gap between the extracted sensitive behaviors and the programming language. To this end, we propose two semantic matching rules to match the behaviors with concrete code snippets such that the apps can be tested experimentally. We design and implement a system called CTDroid for malware analysis, including malware detection (MD) and familial classification (FC). After the evaluation of CTDroid on a large scale of real malware and benign apps, the experimental results demonstrate that CTDroid can achieve 95.8% true positive rate with only 1% false positive rate for MD and 97.9% accuracy for FC. Furthermore, our proposed features are more informative than those of state-of-the-art approaches.

[1]  Qinghua Zheng,et al.  Graph Embedding Based Familial Analysis of Android Malware using Unsupervised Learning , 2019, 2019 IEEE/ACM 41st International Conference on Software Engineering (ICSE).

[2]  George A. Miller,et al.  WordNet: A Lexical Database for English , 1995, HLT.

[3]  Hao Chen,et al.  Attack of the Clones: Detecting Cloned Applications on Android Markets , 2012, ESORICS.

[4]  Qinghua Zheng,et al.  Android Malware Familial Classification and Representative Sample Selection via Frequent Subgraph Analysis , 2018, IEEE Transactions on Information Forensics and Security.

[5]  Srikanth V. Krishnamurthy,et al.  Detecting Android Root Exploits by Learning from Root Providers , 2017, USENIX Security Symposium.

[6]  Qinghua Zheng,et al.  Frequent Subgraph Based Familial Classification of Android Malware , 2016, 2016 IEEE 27th International Symposium on Software Reliability Engineering (ISSRE).

[7]  Konrad Rieck,et al.  Structural detection of android malware using embedded call graphs , 2013, AISec.

[8]  Ming Fan,et al.  DAPASA: Detecting Android Piggybacked Apps Through Sensitive Subgraph Analysis , 2017, IEEE Transactions on Information Forensics and Security.

[9]  Zhong Chen,et al.  AutoCog: Measuring the Description-to-permission Fidelity in Android Applications , 2014, CCS.

[10]  Sencun Zhu,et al.  ViewDroid: towards obfuscation-resilient mobile application repackaging detection , 2014, WiSec '14.

[11]  Ram Krishnan,et al.  Toward a Framework for Detecting Privacy Policy Violations in Android Application Code , 2016, 2016 IEEE/ACM 38th International Conference on Software Engineering (ICSE).

[12]  Lei Xue,et al.  Adaptive Unpacking of Android Apps , 2017, 2017 IEEE/ACM 39th International Conference on Software Engineering (ICSE).

[13]  Konrad Rieck,et al.  DREBIN: Effective and Explainable Detection of Android Malware in Your Pocket , 2014, NDSS.

[14]  Gianluca Stringhini,et al.  MaMaDroid: Detecting Android Malware by Building Markov Chains of Behavioral Models (Extended Version) , 2016, NDSS 2017.

[15]  Leo Breiman,et al.  Random Forests , 2001, Machine Learning.

[16]  Yajin Zhou,et al.  RiskRanker: scalable and accurate zero-day android malware detection , 2012, MobiSys '12.

[17]  Yorick Wilks,et al.  A Closer Look at Skip-gram Modelling , 2006, LREC.

[18]  Tao Xie,et al.  WHYPER: Towards Automating Risk Assessment of Mobile Applications , 2013, USENIX Security Symposium.

[19]  Yanfang Ye,et al.  HinDroid: An Intelligent Android Malware Detection System Based on Structured Heterogeneous Information Network , 2017, KDD.

[20]  Mila Dalla Preda,et al.  GroupDroid: Automatically Grouping Mobile Malware by Extracting Code Similarities , 2017 .

[21]  Annie I. Antón,et al.  A requirements taxonomy for reducing Web site privacy vulnerabilities , 2004, Requirements Engineering.

[22]  Bruce W. Suter,et al.  The multilayer perceptron as an approximation to a Bayes optimal discriminant function , 1990, IEEE Trans. Neural Networks.

[23]  Jacques Klein,et al.  DroidRA: taming reflection to support whole-program analysis of Android apps , 2016, ISSTA.

[24]  Hao Chen,et al.  AnDarwin: Scalable Detection of Semantically Similar Android Applications , 2013, ESORICS.

[25]  Jeffrey Dean,et al.  Distributed Representations of Words and Phrases and their Compositionality , 2013, NIPS.

[26]  David A. Wagner,et al.  I've got 99 problems, but vibration ain't one: a survey of smartphone users' concerns , 2012, SPSM '12.

[27]  Jacques Klein,et al.  An Investigation into the Use of Common Libraries in Android Apps , 2015, 2016 IEEE 23rd International Conference on Software Analysis, Evolution, and Reengineering (SANER).

[28]  Pedro F. Miret,et al.  Wikipedia , 2008, Monatsschrift für Deutsches Recht.

[29]  Tao Zhang,et al.  Can We Trust the Privacy Policies of Android Apps? , 2016, 2016 46th Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN).

[30]  Patrick D. McDaniel,et al.  On lightweight mobile phone application certification , 2009, CCS.

[31]  Jian Liu,et al.  LibD: Scalable and Precise Third-Party Library Detection in Android Markets , 2017, 2017 IEEE/ACM 39th International Conference on Software Engineering (ICSE).

[32]  Juan Enrique Ramos,et al.  Using TF-IDF to Determine Word Relevance in Document Queries , 2003 .

[33]  Alessandra Gorla,et al.  Checking app behavior against app descriptions , 2014, ICSE.

[34]  Isil Dillig,et al.  Automated Synthesis of Semantic Malware Signatures using Maximum Satisfiability , 2016, NDSS.

[35]  Kilian Stoffel,et al.  Theoretical Comparison between the Gini Index and Information Gain Criteria , 2004, Annals of Mathematics and Artificial Intelligence.

[36]  Heng Yin,et al.  DroidAPIMiner: Mining API-Level Features for Robust Malware Detection in Android , 2013, SecureComm.

[37]  J. Ross Quinlan,et al.  C4.5: Programs for Machine Learning , 1992 .

[38]  Christopher D. Manning,et al.  Enhanced English Universal Dependencies: An Improved Representation for Natural Language Understanding Tasks , 2016, LREC.

[39]  Sam Malek,et al.  Lightweight, Obfuscation-Resilient Detection and Family Identification of Android Malware , 2018, ACM Trans. Softw. Eng. Methodol..

[40]  Xiangliang Zhang,et al.  Exploring Permission-Induced Risk in Android Applications for Malicious Application Detection , 2014, IEEE Transactions on Information Forensics and Security.

[41]  Yajin Zhou,et al.  Dissecting Android Malware: Characterization and Evolution , 2012, 2012 IEEE Symposium on Security and Privacy.

[42]  S. Cessie,et al.  Ridge Estimators in Logistic Regression , 1992 .

[43]  Tudor Dumitras,et al.  FeatureSmith: Automatically Engineering Features for Malware Detection by Mining the Security Literature , 2016, CCS.

[44]  Yang Liu,et al.  Semantic modelling of Android malware for effective malware comprehension, detection, and classification , 2016, ISSTA.

[45]  D. Kibler,et al.  Instance-based learning algorithms , 2004, Machine Learning.

[46]  Isil Dillig,et al.  Apposcopy: semantics-based detection of Android malware through static analysis , 2014, SIGSOFT FSE.

[47]  Christopher Krügel,et al.  SOK: (State of) The Art of War: Offensive Techniques in Binary Analysis , 2016, 2016 IEEE Symposium on Security and Privacy (SP).

[48]  Peng Liu,et al.  Achieving accuracy and scalability simultaneously in detecting application clones on Android markets , 2014, ICSE.