Estimating the theoretical error rate for prediction

Prediction for very large data sets is typically carried out in two stages, variable selection and pattern recognition. Ordinarily variable selection involves seeing how well individual explanatory variables are correlated with the dependent variable. This practice neglects the possible interactions among the variables. Simulations have shown that a statistic I, that we used for variable selection is much better correlated with predictivity than significance levels. We explain this by defining theoretical predictivity and show how I is related to predictivity. We calculate the biases of the overoptimistic training estimate of predictivity and of the pessimistic out of sample estimate. Corrections for the bias lead to improved estimates of the potential predictivity using small groups of possibly interacting variables. These results support the use of I in the variable selection phase of prediction for data sets such as in GWAS (Genome wide association studies) where there are very many explanatory variables and modest sample sizes. Reference is made to another publication using I, which led to a reduction in the error rate of prediction from 30% to 8%, for a data set with, 4,918 variables and 97 subjects. This data set had been previously studied by scientists for over 10 years.