X-MIFS: Exact Mutual Information for feature selection

In machine learning, an information-theory optimal way to filter the best input features, without reference to any specific machine learning models, consists of maximizing the mutual information between the selected features and the model output, a choice which will minimize the uncertainty in the output to be predicted, given the feature values. Although this criterion is optimal in the context of information theory, a practical difficulty in using it lies in the need to estimate the mutual information from a limited set of input-output examples, in possibly very-high-dimensional input spaces. Estimating probability densities from some data points in these conditions is far from trivial. Starting from the seminal proposals in [1], different approaches focus on approximating the mutual information by considering a limited set of variable dependencies (like dependencies among couples or triplets), or by assuming specific forms for the probability densities (like Gaussian forms). In this paper we study the effect of considering the exact mutual information between selected features and output, without resorting to any approximation (apart from that implicit and unavoidable in estimating it from experimental data). The objectives of this investigation are: to assess how far one can go by adopting the exact mutual information in terms of CPU time and number of features, and to measure what is lost by adopting some popular approximations which consider only relationships among small subsets of features, assumptions about the distribution of feature values (e.g. Gaussian) or upper bounds on the mutual information as proxies to maximize instead of the exact value. The experimental results show a significant performance advantage when the feature sets identified by exact mutual information are used in both binary and multi-valued classification tasks, with longer CPU times.

[1]  F. Fleuret Fast Binary Feature Selection with Conditional Mutual Information , 2004, J. Mach. Learn. Res..

[2]  Ron Kohavi,et al.  Wrappers for Feature Subset Selection , 1997, Artif. Intell..

[3]  Alex Krizhevsky,et al.  Learning Multiple Layers of Features from Tiny Images , 2009 .

[4]  Honglak Lee,et al.  An Analysis of Single-Layer Networks in Unsupervised Feature Learning , 2011, AISTATS.

[5]  Jacek M. Zurada,et al.  Normalized Mutual Information Feature Selection , 2009, IEEE Transactions on Neural Networks.

[6]  R. Tibshirani Regression Shrinkage and Selection via the Lasso , 1996 .

[7]  Wentian Li Mutual information functions versus correlation functions , 1990 .

[8]  Huan Liu,et al.  Feature Selection for High-Dimensional Data: A Fast Correlation-Based Filter Solution , 2003, ICML.

[9]  Andrew Y. Ng,et al.  The Importance of Encoding Versus Training with Sparse Coding and Vector Quantization , 2011, ICML.

[10]  Isabelle Guyon,et al.  An Introduction to Variable and Feature Selection , 2003, J. Mach. Learn. Res..

[11]  Chong-Ho Choi,et al.  Input Feature Selection by Mutual Information Based on Parzen Window , 2002, IEEE Trans. Pattern Anal. Mach. Intell..

[12]  François Fleuret,et al.  Jointly Informative Feature Selection , 2014, AISTATS.

[13]  David Page,et al.  KDD Cup 2001 report , 2002, SKDD.

[14]  A. Kraskov,et al.  Estimating mutual information. , 2003, Physical review. E, Statistical, nonlinear, and soft matter physics.

[15]  Anil K. Jain,et al.  Feature Selection: Evaluation, Application, and Small Sample Performance , 1997, IEEE Trans. Pattern Anal. Mach. Intell..

[16]  Kari Torkkola,et al.  Feature Extraction by Non-Parametric Mutual Information Maximization , 2003, J. Mach. Learn. Res..

[17]  G. A. Barnard,et al.  Transmission of Information: A Statistical Theory of Communications. , 1961 .

[18]  Ralph Linsker,et al.  Self-organization in a perceptual network , 1988, Computer.

[19]  Roberto Battiti,et al.  Using mutual information for selecting features in supervised neural net learning , 1994, IEEE Trans. Neural Networks.

[20]  L. A. Smith,et al.  Feature Subset Selection: A Correlation Based Filter Approach , 1997, ICONIP.