Using mutual information for selecting features in supervised neural net learning

This paper investigates the application of the mutual information criterion to evaluate a set of candidate features and to select an informative subset to be used as input data for a neural network classifier. Because the mutual information measures arbitrary dependencies between random variables, it is suitable for assessing the "information content" of features in complex classification tasks, where methods bases on linear relations (like the correlation) are prone to mistakes. The fact that the mutual information is independent of the coordinates chosen permits a robust estimation. Nonetheless, the use of the mutual information for tasks characterized by high input dimensionality requires suitable approximations because of the prohibitive demands on computation and samples. An algorithm is proposed that is based on a "greedy" selection of the features and that takes both the mutual information with respect to the output class and with respect to the already-selected features into account. Finally the results of a series of experiments are discussed.

[1]  R. Fisher THE USE OF MULTIPLE MEASUREMENTS IN TAXONOMIC PROBLEMS , 1936 .

[2]  Richard O. Duda,et al.  Pattern classification and scene analysis , 1974, A Wiley-Interscience publication.

[3]  Geoffrey E. Hinton,et al.  Learning internal representations by error propagation , 1986 .

[4]  Fraser,et al.  Independent coordinates for strange attractors from mutual information. , 1986, Physical review. A, General physics.

[5]  Terrence J. Sejnowski,et al.  Analysis of hidden units in a layered network trained to classify sonar targets , 1988, Neural Networks.

[6]  Michael C. Mozer,et al.  Skeletonization: A Technique for Trimming the Fat from a Network via Relevance Assessment , 1988, NIPS.

[7]  Ken-ichi Funahashi,et al.  On the approximate realization of continuous mappings by neural networks , 1989, Neural Networks.

[8]  A. Fraser Reconstructing attractors from scalar time series: A comparison of singular system and redundancy criteria , 1989 .

[9]  John S. Bridle,et al.  Training Stochastic Model Recognition Algorithms as Networks can Lead to Maximum Mutual Information Estimation of Parameters , 1989, NIPS.

[10]  Peter Seitz,et al.  Minimum class entropy: A maximum information approach to layered networks , 1989, Neural Networks.

[11]  Ralph Linsker,et al.  How to Generate Ordered Maps by Maximizing the Mutual Information between Input and Output Signals , 1989, Neural Computation.

[12]  Wentian Li Mutual information functions versus correlation functions , 1990 .

[13]  Geoffrey E. Hinton,et al.  Discovering Viewpoint-Invariant Relationships That Characterize Objects , 1990, NIPS.

[14]  Ehud D. Karnin,et al.  A simple procedure for pruning back-propagation trained neural networks , 1990, IEEE Trans. Neural Networks.

[15]  David E. Rumelhart,et al.  Generalization by Weight-Elimination with Application to Forecasting , 1990, NIPS.

[16]  Heidar A. Malki,et al.  Using the Karhunen-Loe've transformation in the back-propagation training algorithm , 1991, IEEE Trans. Neural Networks.

[17]  Kenji Nakagawa,et al.  On the practical implication of mutual information for statistical decisionmaking , 1991, IEEE Trans. Inf. Theory.