Survival analysis on rare events using group-regularized multi-response Cox regression

We propose a Sparse-Group regularized Cox regression method to improve the prediction performance of large-scale and high-dimensional survival data with few observed events. Our approach is applicable when there is one or more other survival responses that 1. has a large number of observed events; 2. share a common set of associated predictors with the rare event response. This scenario is common in the UK Biobank (Sudlow et al. 2015) dataset where records for a large number of common and rare diseases of the same set of individuals are available. By analyzing these responses together, we hope to achieve higher prediction performance than when they are analyzed individually. To make this approach practical for large-scale data, we developed an accelerated proximal gradient optimization algorithm as well as a screening procedure inspired by Qian et al. (2019). We provide a software implementation of the proposed method and demonstrate its efficacy through simulations and applications to UK Biobank data.

[1]  M. Rivas,et al.  Medical relevance of protein-truncating variants across 337,205 individuals in the UK Biobank study , 2018, Nature Communications.

[2]  F. Harrell,et al.  Evaluating the yield of medical tests. , 1982, JAMA.

[3]  Noah Simon,et al.  A Sparse-Group Lasso , 2013 .

[4]  Trevor Hastie,et al.  Fast Lasso method for large-scale and ultrahigh-dimensional Cox model with applications to UK Biobank. , 2020, Biostatistics.

[5]  Y. Nesterov A method for solving the convex programming problem with convergence rate O(1/k^2) , 1983 .

[6]  Emily K. Tsang,et al.  Effect of predicted protein-truncating genetic variants on the human transcriptome , 2015, Science.

[7]  Saman Khoramian,et al.  An iterative thresholding algorithm for linear inverse problems with multi-constraints and its applications , 2019, 1912.09285.

[8]  Marc Teboulle,et al.  A Fast Iterative Shrinkage-Thresholding Algorithm for Linear Inverse Problems , 2009, SIAM J. Imaging Sci..

[9]  V. Bansal,et al.  Genome-wide association study results for educational attainment aid in identifying genetic heterogeneity of schizophrenia , 2018, Nature Communications.

[10]  R. Tibshirani,et al.  Strong rules for discarding predictors in lasso‐type problems , 2010, Journal of the Royal Statistical Society. Series B, Statistical methodology.

[11]  Trevor Hastie,et al.  Fast Lasso method for Large-scale and Ultrahigh-dimensional Cox Model with applications to UK Biobank , 2020, bioRxiv.

[12]  Esprit study groups Development and validation of a risk score for chronic kidney disease in HIV infection using prospective cohort data from the D:A:D study , 2015 .

[13]  R. Tibshirani Regression Shrinkage and Selection via the Lasso , 1996 .

[14]  Stephen Weston,et al.  Scalable Strategies for Computing with Massive Data , 2013 .

[15]  Trevor Hastie,et al.  A fast and scalable framework for large-scale and ultrahigh-dimensional sparse regression with application to the UK Biobank , 2019, bioRxiv.

[16]  P. Elliott,et al.  UK Biobank: An Open Access Resource for Identifying the Causes of a Wide Range of Complex Diseases of Middle and Old Age , 2015, PLoS medicine.

[17]  D.,et al.  Regression Models and Life-Tables , 2022 .

[18]  Carson C Chow,et al.  Second-generation PLINK: rising to the challenge of larger and richer datasets , 2014, GigaScience.

[19]  I. Daubechies,et al.  An iterative thresholding algorithm for linear inverse problems with a sparsity constraint , 2003, math/0307152.

[20]  M. Yuan,et al.  Model selection and estimation in regression with grouped variables , 2006 .