Learning Representations for Log Data in Cybersecurity

We introduce a framework for exploring and learning representations of log data generated by enterprise-grade security devices with the goal of detecting advanced persistent threats (APTs) spanning over several weeks. The presented framework uses a divide-and-conquer strategy combining behavioral analytics, time series modeling and representation learning algorithms to model large volumes of data. In addition, given that we have access to human-engineered features, we analyze the capability of a series of representation learning algorithms to complement human-engineered features in a variety of classification approaches. We demonstrate the approach with a novel dataset extracted from 3 billion log lines generated at an enterprise network boundaries with reported command and control communications. The presented results validate our approach, achieving an area under the ROC curve of 0.943 and 95 true positives out of the Top 100 ranked instances on the test data set.

[1]  Parvez Ahammad,et al.  SoK: Applying Machine Learning in Security - A Survey , 2016, ArXiv.

[2]  Leo Breiman,et al.  Random Forests , 2001, Machine Learning.

[3]  Brian Hutchinson,et al.  Deep Learning for Unsupervised Insider Threat Detection in Structured Cybersecurity Data Streams , 2017, AAAI Workshops.

[4]  Johannes Bader,et al.  A Comprehensive Measurement Study of Domain Generating Malware , 2016, USENIX Security Symposium.

[5]  Tim Oates,et al.  Imaging Time-Series to Improve Classification and Imputation , 2015, IJCAI.

[6]  Kalyan Veeramachaneni,et al.  AI^2: Training a Big Data Machine to Defend , 2016, 2016 IEEE 2nd International Conference on Big Data Security on Cloud (BigDataSecurity), IEEE International Conference on High Performance and Smart Computing (HPSC), and IEEE International Conference on Intelligent Data and Security (IDS).

[7]  Ralf C. Staudemeyer,et al.  Evaluating performance of long short-term memory recurrent neural networks on intrusion detection data , 2013, SAICSIT '13.

[8]  Martin Rehak,et al.  Identifying and modeling botnet C&C behaviors , 2014, ACySE '14.

[9]  Li Wei,et al.  Fast time series classification using numerosity reduction , 2006, ICML.

[10]  Harry A. Carey Information Processing And Technology Transfer In A Developing Country , 1988 .

[11]  Padhraic Smyth,et al.  Modeling Waveform Shapes with Random Eects Segmental Hidden Markov Models , 2004, UAI 2004.

[12]  Juan José Rodríguez Diez,et al.  Interval and dynamic time warping-based decision trees , 2004, SAC '04.

[13]  Jürgen Schmidhuber,et al.  Long Short-Term Memory , 1997, Neural Computation.

[14]  Ali A. Ghorbani,et al.  Characterization of Encrypted and VPN Traffic using Time-related Features , 2016, ICISSP.

[15]  Hyrum S. Anderson,et al.  Predicting Domain Generation Algorithms with Long Short-Term Memory Networks , 2016, ArXiv.

[16]  Ali A. Ghorbani,et al.  Toward developing a systematic approach to generate benchmark datasets for intrusion detection , 2012, Comput. Secur..

[17]  Marcelo R. Campo,et al.  Survey on network-based botnet detection methods , 2014, Secur. Commun. Networks.

[18]  Andrew W. Senior,et al.  Long Short-Term Memory Based Recurrent Neural Network Architectures for Large Vocabulary Speech Recognition , 2014, ArXiv.

[19]  Yannis Manolopoulos,et al.  Feature-based classification of time-series data , 2001 .

[20]  Jens Myrup Pedersen,et al.  On the use of machine learning for identifying botnet network traffic , 2016, J. Cyber Secur. Mobil..

[21]  Ali A. Ghorbani,et al.  Botnet detection based on traffic behavior analysis and flow intervals , 2013, Comput. Secur..

[22]  Ali A. Ghorbani,et al.  Towards effective feature selection in machine learning-based botnet detection approaches , 2014, 2014 IEEE Conference on Communications and Network Security.