A study of pre-validation

Given a predictor of outcome derived from a high-dimensional dataset, pre-validation is a useful technique for comparing it to competing predictors on the same dataset. For microarray data, it allows one to compare a newly derived predictor for disease outcome to standard clinical predictors on the same dataset. We study pre-validation analytically to determine if the inferences drawn from it are valid. We show that while pre-validation generally works well, the straightforward "one degree of freedom" analytical test from pre-validation can be biased and we propose a permutation test to remedy this problem. In simulation studies, we show that the permutation test has the nominal level and achieves roughly the same power as the analytical test.

[1]  Howard Y. Chang,et al.  Robustness, scalability, and integration of a wound-response gene expression signature in predicting breast cancer survival. , 2005, Proceedings of the National Academy of Sciences of the United States of America.

[2]  B. Efron How Biased is the Apparent Error Rate of a Prediction Rule , 1986 .

[3]  Trevor Hastie,et al.  The Elements of Statistical Learning , 2001 .

[4]  M. Pepe,et al.  Limitations of the odds ratio in gauging the performance of a diagnostic, prognostic, or screening marker. , 2004, American journal of epidemiology.

[5]  Christophe Ambroise,et al.  Selection bias in working with the top genes in supervised classification of tissue samples , 2006 .

[6]  R. Tibshirani Regression Shrinkage and Selection via the Lasso , 1996 .

[7]  Mee Young Park,et al.  L 1-regularization path algorithm for generalized linear models , 2006 .

[8]  J. Ware The limitations of risk factors as prognostic tools. , 2006, The New England journal of medicine.

[9]  Mee Young Park,et al.  L1‐regularization path algorithm for generalized linear models , 2007 .

[10]  R. Tibshirani,et al.  Pre-validation and inference in microarrays , 2002, Statistical applications in genetics and molecular biology.

[11]  Jianming Ye On Measuring and Correcting the Effects of Data Mining and Model Selection , 1998 .

[12]  T. Hastie,et al.  Classification of gene microarrays by penalized logistic regression. , 2004, Biostatistics.

[13]  Yudong D. He,et al.  Gene expression profiling predicts clinical outcome of breast cancer , 2002, Nature.

[14]  S. Dudoit,et al.  Comparison of Discrimination Methods for the Classification of Tumors Using Gene Expression Data , 2002 .

[15]  R. Tibshirani,et al.  An Introduction to the Bootstrap , 1995 .

[16]  R. Tibshirani,et al.  Generalized additive models for medical research , 1986, Statistical methods in medical research.

[17]  R. Tibshirani,et al.  Least angle regression , 2004, math/0406456.