Survival analysis on rare events using group-regularized multi-response Cox regression

MOTIVATION The prediction performance of Cox proportional hazard model suffers when there are only few uncensored events in the training data. RESULTS We propose a Sparse-Group regularized Cox regression method to improve the prediction performance of large-scale and high-dimensional survival data with few observed events. Our approach is applicable when there is one or more other survival responses that 1. has a large number of observed events; 2. share a common set of associated predictors with the rare event response. This scenario is common in the UK Biobank (Sudlow et al., 2015) dataset where records for a large number of common and less prevalent diseases of the same set of individuals are available. By analyzing these responses together, we hope to achieve higher prediction performance than when they are analyzed individually. To make this approach practical for large-scale data, we developed an accelerated proximal gradient optimization algorithm as well as a screening procedure inspired by Qian et al. (2020). AVAILABILITY https://github.com/rivas-lab/multisnpnet-Cox. SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.

[1]  M. Rivas,et al.  Medical relevance of protein-truncating variants across 337,205 individuals in the UK Biobank study , 2018, Nature Communications.

[2]  Trevor Hastie,et al.  Fast Lasso method for large-scale and ultrahigh-dimensional Cox model with applications to UK Biobank. , 2020, Biostatistics.

[3]  Marc Teboulle,et al.  A Fast Iterative Shrinkage-Thresholding Algorithm for Linear Inverse Problems , 2009, SIAM J. Imaging Sci..

[4]  R. Tibshirani Regression Shrinkage and Selection via the Lasso , 1996 .

[5]  Noah Simon,et al.  A Sparse-Group Lasso , 2013 .

[6]  Genetics of 35 blood and urine biomarkers in the UK Biobank , 2020, Nature genetics.

[7]  I. Daubechies,et al.  An iterative thresholding algorithm for linear inverse problems with a sparsity constraint , 2003, math/0307152.

[8]  Emily K. Tsang,et al.  Effect of predicted protein-truncating genetic variants on the human transcriptome , 2015, Science.

[9]  Carson C Chow,et al.  Second-generation PLINK: rising to the challenge of larger and richer datasets , 2014, GigaScience.

[10]  Stephen Weston,et al.  Scalable Strategies for Computing with Massive Data , 2013 .

[11]  F. Harrell,et al.  Evaluating the yield of medical tests. , 1982, JAMA.

[12]  Christopher M. DeBoever,et al.  Pervasive additive and non-additive effects within the HLA region contribute to disease risk in the UK Biobank , 2020, bioRxiv.

[13]  D.,et al.  Regression Models and Life-Tables , 2022 .

[14]  M. Yuan,et al.  Model selection and estimation in regression with grouped variables , 2006 .

[15]  M. Rivas,et al.  Phenome-wide Burden of Copy Number Variation in the UK Biobank. , 2019, American journal of human genetics.

[16]  R. Tibshirani,et al.  Strong rules for discarding predictors in lasso‐type problems , 2010, Journal of the Royal Statistical Society. Series B, Statistical methodology.

[17]  P. Elliott,et al.  UK Biobank: An Open Access Resource for Identifying the Causes of a Wide Range of Complex Diseases of Middle and Old Age , 2015, PLoS medicine.

[18]  R. Tibshirani,et al.  A fast and scalable framework for large-scale and ultrahigh-dimensional sparse regression with application to the UK Biobank , 2019, bioRxiv.