A Feature Selection Methodology for Steganalysis
Yoan Miche, Patrick Bas, Amaury Lendasse, Christian Jutten and Olli Simula
TRAITEMENT DU SIGNAL
Steganography has been known and used for a very long time, as a way to exchange information in an unnoticeable manner between parties, by embedding it in another, apparently innocuous, document. Nowadays steganographic techniques are mostly used on digital content. The online newspaper Wired News, reported in one of its articles  on steganography that several steganographic contents have been found on web sites with very large image database such as eBay. Niels Provos  has somewhat refuted these ideas by analyzing and classifying two million images from eBay and one million from USENet network and not finding any steganographic content embedded in these images. This could be due to many reasons, such as very low payloads, making the steganographic images very robust and secure to steganalysis,
The security of a steganographic scheme has been defined theoretically by Cachin in  but this definition is very seldomly usable in practice. It requires to evaluate distributions and measure the Kullback-Leibler divergence between them.
In practice, steganalysis is used as a way to evaluate the security of a steganographic scheme empirically: it aims at detecting whether a medium has been tampered with - but not to detect what is in the medium or how it has been embedded. By the use of features, one can get some relevant characteristics of the considered medium, and assess, by the use of machine learning tools, usually, whether the medium is genuine or not, This is only one way to perform steganalysis, but it remains the most common.
One of the main issues with this scheme is that people tend to use more and more features extracted from the media (we consider only JPEG images in this article) in order to increase the performances of detection of modified images. This number of features corresponds to the dimensionality of the space in which are performed machine learning processes (typically, training of a classifier). This usually leads to very high dimensional spaces for which many problems arise (in comparison to low dimensional spaces): mainly, the required number of images to have an appropriate filling of the space in which the classifier is trained, is never reached. This filling is required for the classifier to train on properly distributed data among the feature space. Also, when the number of features is too high, interpretation of the most relevant features becomes very difficult if not to say impossible.
In this article, some of the problems encountered because of the high dimensionality of the problem usually met in steganalysis, are presented, along with possible solutions.
To the problem of the required number of images for filling the space, is proposed an evaluation of a sufficient number of images: a bootstrap algorithm is used to estimate the variance of the classifier's results for different amounts of images. Once the variance is low enough to have accurate results, the number of images required for that number of features is attained.
With this sufficient number of images, feature selection is then performed, with a forward algorithm, in an attempt to decrease the dimensionality and also to gain interpretability over which features have been reacting the most. Hence, a knowledge of the steganographic's scheme can be inferred and its scheme could be modified accordingly to improve its security.
These ideas are combined in a methodology, which is tested on 6 different steganographic algorithms, for different sizes of the embedded information. The result is an estimation of the sufficient number of images for obtaining results with low enough variance. Selected sets of features also enable to keep the same performances (within the small variance range) while providing insights on the weaknesses of each algorithm. These weaknesses are analyzed separately for each algorithm.
In conclusion, the proposed methodology enabled to estimate the variance of typically given results for steganalysis, along with added interpretability. The proposed reduced sets of features have also made it possible to keep the same performances as for the full set.