Une méthodologie pour la Sélection de Variables pour la Stéganalyse A Feature Selection Methodology for Steganalysis

and key words Steganography has been known and used for a very long time, as a way to exchange information in an unnoticeable manner between parties, by embedding it in another, apparently innocuous, document. Nowadays steganographic techniques are mostly used on digital content. The online newspaper Wired News, reported in one of its articles [2] on steganography that several steganographic contents have been found on web sites with very large image database such as eBay. Niels Provos [3] has somewhat refuted these ideas by analyzing and classifying two million images from eBay and one million from USENet network and not finding any steganographic content embedded in these images. This could be due to many reasons, such as very low payloads, making the steganographic images very robust and secure to steganalysis. The security of a steganographic scheme has been defined theoretically by Cachin in [1] but this definition is very seldomly usable in practice. It requires to evaluate distributions and measure the Kullback-Leibler divergence between them. In practice, steganalysis is used as a way to evaluate the security of a steganographic scheme empirically: it aims at detecting whether a medium has been tampered with – but not to detect what is in the medium or how it has been embedded. By the use of features, one can get some relevant characteristics of the considered medium, and assess, by the use of machine learning tools, usually, whether the medium is genuine or not. This is only one way to perform steganalysis, but it remains the most common. traitement du signal 2009_volume 26_numero 1 13 One of the main issues with this scheme is that people tend to use more and more features extracted from the media (we consider only JPEG images in this article) in order to increase the performances of detection of modified images. This number of features corresponds to the dimensionality of the space in which are performed machine learning processes (typically, training of a classifier). This usually leads to very high dimensional spaces for which many problems arise (in comparison to low dimensional spaces): mainly, the required number of images to have an appropriate filling of the space in which the classifier is trained, is never reached. This filling is required for the classifier to train on properly distributed data among the feature space. Also, when the number of features is too high, interpretation of the most relevant features becomes very difficult if not to say impossible. In this article, some of the problems encountered because of the high dimensionality of the problem usually met in steganalysis, are presented, along with possible solutions. To the problem of the required number of images for filling the space, is proposed an evaluation of a sufficient number of images: a bootstrap algorithm is used to estimate the variance of the classifier’s results for different amounts of images. Once the variance is low enough to have accurate results, the number of images required for that number of features is attained. With this sufficient number of images, feature selection is then performed, with a forward algorithm, in an attempt to decrease the dimensionality and also to gain interpretability over which features have been reacting the most. Hence, a knowledge of the steganographic’s scheme can be inferred and its scheme could be modified accordingly to improve its security. These ideas are combined in a methodology, which is tested on 6 different steganographic algorithms, for different sizes of the embedded information. The result is an estimation of the sufficient number of images for obtaining results with low enough variance. Selected sets of features also enable to keep the same performances (within the small variance range) while providing insights on the weaknesses of each algorithm. These weaknesses are analyzed separately for each algorithm. In conclusion, the proposed methodology enabled to estimate the variance of typically given results for steganalysis, along with added interpretability. The proposed reduced sets of features have also made it possible to keep the same performances as for the full set.

[1]  Jessica J. Fridrich,et al.  Feature-Based Steganalysis for JPEG Images and Its Implications for Future Design of Steganographic Schemes , 2004, Information Hiding.

[2]  Phil Sallee,et al.  Model-Based Steganography , 2003, IWDW.

[3]  Petra Mutzel,et al.  A Graph-Theoretic Approach to Steganography , 2005, Communications and Multimedia Security.

[4]  Michel Verleysen,et al.  Mutual information for the selection of relevant variables in spectrometric nonlinear modelling , 2006, ArXiv.

[5]  Chee Kheong Siew,et al.  Extreme learning machine: Theory and applications , 2006, Neurocomputing.

[6]  Amaury Lendasse,et al.  Extracting relevant features of steganographic schemes by feature selection techniques , 2007 .

[7]  Christian Cachin,et al.  An information-theoretic model for steganography , 1998, Inf. Comput..

[8]  R. Tibshirani,et al.  An introduction to the bootstrap , 1993 .

[9]  Michel Verleysen,et al.  The Curse of Dimensionality in Data Mining and Time Series Prediction , 2005, IWANN.

[10]  Amaury Lendasse,et al.  Long-term prediction of time series using NNE-based projection and OP-ELM , 2008, 2008 IEEE International Joint Conference on Neural Networks (IEEE World Congress on Computational Intelligence).

[11]  Richard Bellman,et al.  Adaptive Control Processes: A Guided Tour , 1961, The Mathematical Gazette.

[12]  M. Kenward,et al.  An Introduction to the Bootstrap , 2007 .

[13]  Amaury Lendasse,et al.  A Methodology for Building Regression Models using Extreme Learning Machine: OP-ELM , 2008, ESANN.

[14]  Niels Provos,et al.  Defending Against Statistical Steganalysis , 2001, USENIX Security Symposium.

[15]  Dana S. Richards,et al.  Modified Matrix Encoding Technique for Minimal Distortion Steganography , 2006, Information Hiding.

[16]  References , 1971 .

[17]  Yun Q. Shi,et al.  A Markov Process Based Approach to Effective Attacking JPEG Steganography , 2006, Information Hiding.

[18]  S. T. Buckland,et al.  An Introduction to the Bootstrap , 1994 .

[19]  Siwei Lyu,et al.  Detecting Hidden Messages Using Higher-Order Statistics and Support Vector Machines , 2002, Information Hiding.

[20]  D. François High-dimensional data analysis : optimal metrics and feature selection/ , 2007 .

[21]  Tomás Pevný,et al.  Merging Markov and DCT features for multi-class JPEG steganalysis , 2007, Electronic Imaging.

[22]  Amaury Lendasse,et al.  A Feature Selection Methodology for Steganalysis , 2006, MRCS.

[23]  Andreas Westfeld,et al.  F5-A Steganographic Algorithm , 2001, Information Hiding.

[24]  Niels Provos,et al.  Detecting Steganographic Content on the Internet , 2002, NDSS.

[25]  Andreas Pfitzmann,et al.  Attacks on Steganographic Systems , 1999, Information Hiding.