Survival analysis with high-dimensional covariates

In recent years, breakthroughs in biomedical technology have led to a wealth of data in which the number of features (for instance, genes on which expression measurements are available) exceeds the number of observations (e.g. patients). Sometimes survival outcomes are also available for those same observations. In this case, one might be interested in (a) identifying features that are associated with survival (in a univariate sense), and (b) developing a multivariate model for the relationship between the features and survival that can be used to predict survival in a new observation. Due to the high dimensionality of this data, most classical statistical methods for survival analysis cannot be applied directly. Here, we review a number of methods from the literature that address these two problems.

[1]  Ash A. Alizadeh,et al.  Distinct types of diffuse large B-cell lymphoma identified by gene expression profiling , 2000, Nature.

[2]  R. Tibshirani Regression Shrinkage and Selection via the Lasso , 1996 .

[3]  B. Nan,et al.  Survival Analysis with High-Dimensional Covariates , 2010 .

[4]  David E. Misek,et al.  Gene-expression profiles predict survival of patients with lung adenocarcinoma , 2002, Nature Medicine.

[5]  L. V. van't Veer,et al.  Cross‐validated Cox regression on microarray gene expression data , 2006, Statistics in medicine.

[6]  Robert Tibshirani,et al.  TESTING SIGNIFICANCE OF FEATURES BY LASSOED PRINCIPAL COMPONENTS. , 2008, The annals of applied statistics.

[7]  Mee Young Park,et al.  L1‐regularization path algorithm for generalized linear models , 2007 .

[8]  R. Tibshirani,et al.  Pre-validation and inference in microarrays , 2002, Statistical applications in genetics and molecular biology.

[9]  Torben Martinussen,et al.  Covariate Selection for the Semiparametric Additive Risk Model , 2009 .

[10]  Jeffrey T Leek,et al.  The optimal discovery procedure for large-scale significance testing, with applications to comparative microarray experiments. , 2007, Biostatistics.

[11]  E. Dougherty,et al.  Gene-expression profiles in hereditary breast cancer. , 2001, The New England journal of medicine.

[12]  R. Tibshirani,et al.  Supervised harvesting of expression trees , 2001, Genome Biology.

[13]  Ker-Chau Li,et al.  Sliced Inverse Regression for Dimension Reduction , 1991 .

[14]  Arnoldo Frigessi,et al.  BIOINFORMATICS ORIGINAL PAPER doi:10.1093/bioinformatics/btm305 Gene expression Predicting survival from microarray data—a comparative study , 2022 .

[15]  Yudong D. He,et al.  Gene expression profiling predicts clinical outcome of breast cancer , 2002, Nature.

[16]  X. Cui,et al.  Statistical tests for differential expression in cDNA microarray experiments , 2003, Genome Biology.

[17]  Jane-Ling Wang,et al.  Dimension reduction for censored regression data , 1999 .

[18]  C. Gieger,et al.  Genomewide association analysis of coronary artery disease. , 2007, The New England journal of medicine.

[19]  Jiang Gui,et al.  Partial Cox regression analysis for high-dimensional microarray gene expression data , 2004, ISMB/ECCB.

[20]  Anestis Antoniadis,et al.  The Dantzig Selector in Cox's Proportional Hazards Model , 2009 .

[21]  P. J. Verweij,et al.  Cross-validation in survival analysis. , 1993, Statistics in medicine.

[22]  Ingrid Lönnstedt Replicated microarray data , 2001 .

[23]  J. Kalbfleisch,et al.  The Statistical Analysis of Failure Time Data , 1980 .

[24]  John D. Storey,et al.  Statistical significance for genomewide studies , 2003, Proceedings of the National Academy of Sciences of the United States of America.

[25]  Danh V. Nguyen,et al.  Partial least squares proportional hazard regression for application to DNA microarray survival data , 2002, Bioinform..

[26]  Jiang Gui,et al.  Penalized Cox regression analysis in the high-dimensional and low-sample size settings, with applications to microarray gene expression data , 2005, Bioinform..

[27]  D. Allison,et al.  Microarray data analysis: from disarray to consolidation and consensus , 2006, Nature Reviews Genetics.

[28]  Terence Tao,et al.  The Dantzig selector: Statistical estimation when P is much larger than n , 2005, math/0506081.

[29]  D. Botstein,et al.  Cluster analysis and display of genome-wide expression patterns. , 1998, Proceedings of the National Academy of Sciences of the United States of America.

[30]  Ker-Chau Li Sliced inverse regression for dimension reduction (with discussion) , 1991 .

[31]  J. Klein,et al.  Survival Analysis: Techniques for Censored and Truncated Data , 1997 .

[32]  E Graf,et al.  Assessment and comparison of prognostic classification schemes for survival data. , 1999, Statistics in medicine.

[33]  Hongzhe Li,et al.  Dimension reduction methods for microarrays with application to censored survival data , 2004, Bioinform..

[34]  Robert J Tibshirani,et al.  Statistical Applications in Genetics and Molecular Biology , 2011 .

[35]  Uc San Francisco,et al.  Microarray Gene Expression Data with Linked Survival Phenotypes: Diffuse Large-B-Cell Lymphoma Revisited , 2005 .

[36]  S. Dudoit,et al.  Multiple Hypothesis Testing in Microarray Experiments , 2003 .

[37]  T. Lumley,et al.  Time‐Dependent ROC Curves for Censored Survival Data and a Diagnostic Marker , 2000, Biometrics.

[38]  Meland,et al.  THE USE OF MOLECULAR PROFILING TO PREDICT SURVIVAL AFTER CHEMOTHERAPY FOR DIFFUSE LARGE-B-CELL LYMPHOMA , 2002 .

[39]  W. Massy Principal Components Regression in Exploratory Statistical Research , 1965 .

[40]  R. Tibshirani,et al.  Semi-Supervised Methods to Predict Patient Survival from Gene Expression Data , 2004, PLoS biology.

[41]  T. Hudson,et al.  A genome-wide association study identifies novel risk loci for type 2 diabetes , 2007, Nature.

[42]  R. Tibshirani,et al.  Significance analysis of microarrays applied to the ionizing radiation response , 2001, Proceedings of the National Academy of Sciences of the United States of America.

[43]  R. Tibshirani,et al.  Repeated observation of breast tumor subtypes in independent gene expression data sets , 2003, Proceedings of the National Academy of Sciences of the United States of America.

[44]  D. Balding A tutorial on statistical methods for population association studies , 2006, Nature Reviews Genetics.

[45]  R. Tibshirani,et al.  Covariance‐regularized regression and classification for high dimensional problems , 2009, Journal of the Royal Statistical Society. Series B, Statistical methodology.

[46]  Judy H Cho,et al.  Genome-wide association study identifies new susceptibility loci for Crohn disease and implicates autophagy in disease pathogenesis , 2007, Nature Genetics.

[47]  Shuangge Ma,et al.  Additive Risk Models for Survival Data with High‐Dimensional Covariates , 2006, Biometrics.

[48]  Christian A. Rees,et al.  Distinctive gene expression patterns in human mammary epithelial cells and breast cancers. , 1999, Proceedings of the National Academy of Sciences of the United States of America.

[49]  X. Cui,et al.  Improved statistical tests for differential gene expression by shrinking variance components estimates. , 2005, Biostatistics.

[50]  P. J. Verweij,et al.  Penalized likelihood in Cox regression. , 1994, Statistics in medicine.

[51]  Lu Tian,et al.  Linking gene expression data with patient survival times using partial least squares , 2002, ISMB.

[52]  Robert Tibshirani,et al.  Gene Expression Profiling Predicts Survival in Conventional Renal Cell Carcinoma , 2005, PLoS medicine.

[53]  Judy H. Cho,et al.  A Genome-Wide Association Study Identifies IL23R as an Inflammatory Bowel Disease Gene , 2006, Science.

[54]  H. Zou,et al.  Regularization and variable selection via the elastic net , 2005 .

[55]  A. E. Hoerl,et al.  Ridge regression: biased estimation for nonorthogonal problems , 2000 .

[56]  Laurence L. George,et al.  The Statistical Analysis of Failure Time Data , 2003, Technometrics.

[57]  R. Tibshirani,et al.  Prediction by Supervised Principal Components , 2006 .

[58]  Y. Benjamini,et al.  Controlling the false discovery rate: a practical and powerful approach to multiple testing , 1995 .

[59]  Harald Binder,et al.  Assessment of survival prediction models based on microarray data , 2007, Bioinform..

[60]  M. Daly,et al.  Genome-wide association studies for common diseases and complex traits , 2005, Nature Reviews Genetics.

[61]  R. Tibshirani The lasso method for variable selection in the Cox model. , 1997, Statistics in medicine.

[62]  R. Tibshirani,et al.  Gene expression patterns of breast carcinomas distinguish tumor subclasses with clinical implications , 2001, Proceedings of the National Academy of Sciences of the United States of America.

[63]  Hongzhe Li,et al.  Boosting proportional hazards models using smoothing splines, with applications to high-dimensional microarray data , 2005, Bioinform..

[64]  J. Mesirov,et al.  Molecular classification of cancer: class discovery and class prediction by gene expression monitoring. , 1999, Science.

[65]  Anne-Laure Boulesteix,et al.  Partial least squares: a versatile tool for the analysis of high-dimensional genomic data , 2006, Briefings Bioinform..

[66]  T. Poggio,et al.  Multiclass cancer diagnosis using tumor gene expression signatures , 2001, Proceedings of the National Academy of Sciences of the United States of America.