Nonparametric selection of input variables for connectionist learning

When many possible input variables to a statistical model exist, removing unimportant inputs can improve the model's performance signiicantly. A new method for selecting input variables is proposed. Components for the proposed method include: Mutual information as a relevance measure Kernel density estimation for estimating probabilities Forward selection as an input variable search method Analysis of mutual information shows that it is a natural measure of input variable relevance. It is a more general measure of input variable relevance than expected conditional variance. Under certain conditions, the two measures order the relevance of input variable subsets in precisely the same manner, but these conditions do not generally hold. An unbiased approximation to mutual information exists, but it is unbiased only if the underlying probabilities are exact. Analysis of kernel density estimation shows that the accuracy of mutual information estimates depends directly on how densely populated the points in the data set are. However, for a range of explored problems, the relative ordering of mutual information estimates remains correct, despite inaccuracies in individual estimates. Analysis of forward selection explores the amount of data required to select a certain number of relevant input variables. It is shown that in order to select a certain number of relevant input variables, the amount of required data increases roughly exponentially as more relevant input variables are considered. It is also shown that the chances of forward selection ending up in a local minimum are reduced by bootstrapping the data. Finally, the method is compared to two connectionist methods for input variable selection: Sensitivity Based Pruning and Automatic Relevance Determination. It is shown that the new method outperforms these two when the number of independent, candidate input variables is large. However, the method requires the number of relevant input variables to be relatively small. These results are connrmed on a number of real world prediction problems, including the prediction of energy consumption in a building, the prediction of heart rate in a patient with sleep apnea, and the prediction of wind force in a wind turbine. Nonlinear problem: number of relevant input variables selected as a function of problem size and data set 5.10 Nonlinear problem: number of irrelevant input variables selected as a function of problem size and data set Nonlinear problem: number of dependent input variables selected as a function of problem size and data set vii 7.10 Diierences in means and variances for …

[1]  Anders Krogh,et al.  Introduction to the theory of neural computation , 1994, The advanced book program.

[2]  Andreas S. Weigend,et al.  Time Series Prediction: Forecasting the Future and Understanding the Past , 1994 .

[3]  Andrew W. Moore,et al.  Hoeffding Races: Accelerating Model Selection Search for Classification and Function Approximation , 1993, NIPS.

[4]  J. Utans,et al.  Input variable selection for neural networks: application to predicting the U.S. business cycle , 1995, Proceedings of 1995 Conference on Computational Intelligence for Financial Engineering (CIFEr).

[5]  John Moody,et al.  Prediction Risk and Architecture Selection for Neural Networks , 1994 .

[6]  Geoffrey E. Hinton,et al.  Bayesian Learning for Neural Networks , 1995 .

[7]  D. W. Scott,et al.  Multivariate Density Estimation, Theory, Practice and Visualization , 1992 .

[8]  E. Beale,et al.  Note on Procedures for Variable Selection in Multiple Regression , 1970 .

[9]  Hans Henrik Thodberg,et al.  Bayesian Backprop in Action: Pruning, Committees, Error Bars and an Application to Spectroscopy , 1993, NIPS.

[10]  J. Hartigan,et al.  The Dip Test of Unimodality , 1985 .

[11]  C. D. Kemp,et al.  Density Estimation for Statistics and Data Analysis , 1987 .

[12]  Thomas M. Cover,et al.  Elements of Information Theory , 2005 .

[13]  Roberto Battiti,et al.  Using mutual information for selecting features in supervised neural net learning , 1994, IEEE Trans. Neural Networks.

[14]  P. J. Green,et al.  Density Estimation for Statistics and Data Analysis , 1987 .

[15]  M. Thompson Selection of Variables in Multiple Regression: Part I. A Review and Evaluation , 1978 .

[16]  Andrew M. Fraser,et al.  Information and entropy in strange attractors , 1989, IEEE Trans. Inf. Theory.

[17]  A. Atkinson Subset Selection in Regression , 1992 .

[18]  B. Silverman,et al.  Using Kernel Density Estimates to Investigate Multimodality , 1981 .

[19]  Jan Korst,et al.  Deterministic and randomized local search , 1993 .

[20]  W. Härdle Applied Nonparametric Regression , 1992 .

[21]  C. L. Mallows Some comments on C_p , 1973 .

[22]  John E. Moody,et al.  Principled Architecture Selection for Neural Networks: Application to Corporate Bond Rating Prediction , 1991, NIPS.

[23]  J. Ross Quinlan,et al.  Combining Instance-Based and Model-Based Learning , 1993, ICML.