Fast Lasso method for large-scale and ultrahigh-dimensional Cox model with applications to UK Biobank.

We develop a scalable and highly efficient algorithm to fit a Cox proportional hazard model by maximizing the $L^1$-regularized (Lasso) partial likelihood function, based on the Batch Screening Iterative Lasso (BASIL) method developed in Qian and others (2019). Our algorithm is particularly suitable for large-scale and high-dimensional data that do not fit in the memory. The output of our algorithm is the full Lasso path, the parameter estimates at all predefined regularization parameters, as well as their validation accuracy measured using the concordance index (C-index) or the validation deviance. To demonstrate the effectiveness of our algorithm, we analyze a large genotype-survival time dataset across 306 disease outcomes from the UK Biobank (Sudlow and others, 2015). We provide a publicly available implementation of the proposed approach for genetics data on top of the PLINK2 package and name it snpnet-Cox.

[1]  N. Breslow Covariance analysis of censored survival data. , 1974, Biometrics.

[2]  F. Harrell,et al.  Evaluating the yield of medical tests. , 1982, JAMA.

[3]  W. Barlow,et al.  Residuals for relative risk regression , 1988 .

[4]  R. Tibshirani Regression Shrinkage and Selection via the Lasso , 1996 .

[5]  R. Tukey,et al.  Human UDP-glucuronosyltransferases: metabolism, expression, and disease. , 2000, Annual review of pharmacology and toxicology.

[6]  P. Bosma Inherited disorders of bilirubin metabolism. , 2003, Journal of hepatology.

[7]  R. Terkeltaub Clinical practice. Gout. , 2003, The New England journal of medicine.

[8]  R. McNamara,et al.  Management of Atrial Fibrillation: Review of the Evidence for the Role of Pharmacologic Therapy, Electrical Cardioversion, and Echocardiography , 2003, Annals of Internal Medicine.

[9]  Mee Young Park,et al.  L1‐regularization path algorithm for generalized linear models , 2007 .

[10]  A. Hofman,et al.  Association of three genetic loci with uric acid concentration and risk of gout: a genome-wide association study , 2008, The Lancet.

[11]  D. Postma,et al.  Sequence variants affecting eosinophil numbers associate with asthma and myocardial infarction , 2009, Nature Genetics.

[12]  Insuk Sohn,et al.  Gradient lasso for Cox proportional hazards model , 2009, Bioinform..

[13]  Trevor Hastie,et al.  Regularization Paths for Generalized Linear Models via Coordinate Descent. , 2010, Journal of statistical software.

[14]  J. Goeman L1 Penalized Estimation in the Cox Proportional Hazards Model , 2009, Biometrical journal. Biometrische Zeitschrift.

[15]  R. Tibshirani,et al.  Strong rules for discarding predictors in lasso‐type problems , 2010, Journal of the Royal Statistical Society. Series B, Statistical methodology.

[16]  Stephen Weston,et al.  Scalable Strategies for Computing with Massive Data , 2013 .

[17]  C. Morrison,et al.  Hormonal Contraception and the Risk of HIV Acquisition: An Individual Participant Data Meta-analysis , 2015, PLoS medicine.

[18]  Jonathan Taylor,et al.  Statistical learning and selective inference , 2015, Proceedings of the National Academy of Sciences.

[19]  P. Elliott,et al.  UK Biobank: An Open Access Resource for Identifying the Causes of a Wide Range of Complex Diseases of Middle and Old Age , 2015, PLoS medicine.

[20]  D. Gudbjartsson,et al.  A rare IL33 loss-of-function mutation reduces blood eosinophil counts and protects from asthma , 2017, PLoS genetics.

[21]  J. Kelsen,et al.  The role of monogenic disease in children with very early onset inflammatory bowel disease , 2017, Current opinion in pediatrics.

[22]  P. Donnelly,et al.  The UK Biobank resource with deep phenotyping and genomic data , 2018, Nature.

[23]  Daniel Lemire,et al.  Faster Population Counts Using AVX2 Instructions , 2016, Comput. J..

[24]  Trevor Hastie,et al.  A fast and scalable framework for large-scale and ultrahigh-dimensional sparse regression with application to the UK Biobank , 2019, bioRxiv.

[25]  M. Rivas,et al.  Phenome-wide Burden of Copy Number Variation in the UK Biobank. , 2019, American journal of human genetics.

[26]  Robert Tibshirani,et al.  On the Use of C-index for Stratified and Cross-Validated Cox Model , 2019, 1911.09638.

[27]  Christopher M. DeBoever,et al.  Global Biobank Engine: enabling genotype-phenotype browsing for biobank summary statistics , 2018, bioRxiv.

[28]  Trevor Hastie,et al.  A Fast and Flexible Algorithm for Solving the Lasso in Large-scale and Ultrahigh-dimensional Problems , 2019 .

[29]  D.,et al.  Regression Models and Life-Tables , 2022 .