Fast Numerical Optimization for Genome Sequencing Data in Population Biobanks

We develop two efficient solvers for optimization problems arising from large-scale regularized regressions on millions of genetic variants sequenced from hundreds of thousands of individuals. These genetic variants are encoded by the values in the set {0, 1, 2, NA}. We take advantage of this fact and use two bits to represent each entry in a genetic matrix, which reduces memory requirement by a factor of 32 compared to a double precision floating point representation. Using this representation, we implemented an iteratively reweighted least square algorithm to solve Lasso regressions on genetic matrices, which we name snpnet-2.0. When the dataset contains many rare variants, the predictors can be encoded in a sparse matrix. We utilize the sparsity in the predictor matrix to further reduce memory requirement and computational speed. Our sparse genetic matrix implementation uses both the compact 2-bit representation and a simplified version of compressed sparse block format so that matrix-vector multiplications can be effectively parallelized on multiple CPU cores. To demonstrate the effectiveness of this representation, we implement an accelerated proximal gradient method to solve group Lasso on these sparse genetic matrices. This solver is named sparse-snpnet, and will also be included as part of snpnet R package. Our implementation is able to solve group Lasso problems on sparse genetic matrices with more than 1, 000, 000 columns and almost 100, 000 rows within 10 minutes and using less than 32GB of memory.

[1]  Trevor Hastie,et al.  Genetics of 35 blood and urine biomarkers in the UK Biobank , 2020, Nature Genetics.

[2]  Trevor Hastie,et al.  Fast Lasso method for large-scale and ultrahigh-dimensional Cox model with applications to UK Biobank. , 2020, Biostatistics.

[3]  Christopher M. DeBoever,et al.  Pervasive additive and non-additive effects within the HLA region contribute to disease risk in the UK Biobank , 2020, bioRxiv.

[4]  Bjarni J. Vilhjálmsson,et al.  LDpred2: better, faster, stronger , 2020, bioRxiv.

[5]  M. Rivas,et al.  Phenome-wide Burden of Copy Number Variation in the UK Biobank. , 2019, American journal of human genetics.

[6]  Trevor Hastie,et al.  A fast and scalable framework for large-scale and ultrahigh-dimensional sparse regression with application to the UK Biobank , 2019, bioRxiv.

[7]  Matthew Aguirre,et al.  Phenome-wide burden of copy number variation in UK Biobank , 2019, bioRxiv.

[8]  Naomi R. Wray,et al.  Improved polygenic prediction by Bayesian multiple regression on summary statistics , 2019, Nature Communications.

[9]  T. Ge,et al.  Polygenic prediction via Bayesian regression and continuous shrinkage priors , 2018, bioRxiv.

[10]  Andrey Ziyatdinov,et al.  Efficient analysis of large-scale genome-wide data with two R packages: bigstatsr and bigsnpr , 2018, Bioinform..

[11]  Christopher M. DeBoever,et al.  Medical relevance of protein-truncating variants across 337,205 individuals in the UK Biobank study , 2017, bioRxiv.

[12]  P. Elliott,et al.  UK Biobank: An Open Access Resource for Identifying the Causes of a Wide Range of Complex Diseases of Middle and Old Age , 2015, PLoS medicine.

[13]  Carson C Chow,et al.  Second-generation PLINK: rising to the challenge of larger and richer datasets , 2014, GigaScience.

[14]  B. Berger,et al.  Efficient Bayesian mixed model analysis increases association power in large cohorts , 2014, Nature Genetics.

[15]  Trevor Hastie,et al.  Regularization Paths for Cox's Proportional Hazards Model via Coordinate Descent. , 2011, Journal of statistical software.

[16]  Trevor Hastie,et al.  Regularization Paths for Generalized Linear Models via Coordinate Descent. , 2010, Journal of statistical software.

[17]  Marc Teboulle,et al.  A Fast Iterative Shrinkage-Thresholding Algorithm for Linear Inverse Problems , 2009, SIAM J. Imaging Sci..

[18]  M. Yuan,et al.  Model selection and estimation in regression with grouped variables , 2006 .

[19]  H. Zou,et al.  Regularization and variable selection via the elastic net , 2005 .

[20]  I. Daubechies,et al.  An iterative thresholding algorithm for linear inverse problems with a sparsity constraint , 2003, math/0307152.

[21]  Stefan Sperlich,et al.  Generalized Additive Models , 2014 .

[22]  R. Tibshirani Regression Shrinkage and Selection via the Lasso , 1996 .

[23]  Y. Nesterov A method for solving the convex programming problem with convergence rate O(1/k^2) , 1983 .

[24]  D.,et al.  Regression Models and Life-Tables , 2022 .