A fast and scalable framework for large-scale and ultrahigh-dimensional sparse regression with application to the UK Biobank

The UK Biobank (Bycroft et al., 2018) is a very large, prospective population-based cohort study across the United Kingdom. It provides unprecedented opportunities for researchers to investigate the relationship between genotypic information and phenotypes of interest. Multiple regression methods, compared with GWAS, have already been showed to greatly improve the prediction performance for a variety of phenotypes. In the high-dimensional settings, the lasso (Tibshirani, 1996), since its first proposal in statistics, has been proved to be an effective method for simultaneous variable selection and estimation. However, the large scale and ultrahigh dimension seen in the UK Biobank pose new challenges for applying the lasso method, as many existing algorithms and their implementations are not scalable to large applications. In this paper, we propose a computational framework called batch screening iterative lasso (BASIL) that can take advantage of any existing lasso solver and easily build a scalable solution for very large data, including those that are larger than the memory size. We introduce snpnet, an R package that implements the proposed algorithm on top of glmnet (Friedman et al., 2010a) and optimizes for single nucleotide polymorphism (SNP) datasets. It currently supports ℓ1-penalized linear model, logistic regression, Cox model, and also extends to the elastic net with ℓ1/ℓ2 penalty. We demonstrate results on the UK Biobank dataset, where we achieve superior predictive performance on quantitative and qualitative traits including height, body mass index, asthma and high cholesterol. Author Summary With the advent and evolution of large-scale and comprehensive biobanks, there come up unprecedented opportunities for researchers to further uncover the complex landscape of human genetics. One major direction that attracts long-standing interest is the investigation of the relationships between genotypes and phenotypes. This includes but doesn’t limit to the identification of genotypes that are significantly associated with the phenotypes, and the prediction of phenotypic values based on the genotypic information. Genome-wide association studies (GWAS) is a very powerful and widely used framework for the former task, having produced a number of very impactful discoveries. However, when it comes to the latter, its performance is fairly limited by the univariate nature. To address this, multiple regression methods have been suggested to fill in the gap. That said, challenges emerge as the dimension and the size of datasets both become large nowadays. In this paper, we present a novel computational framework that enables us to solve efficiently the entire lasso or elastic-net solution path on large-scale and ultrahigh-dimensional data, and therefore make simultaneous variable selection and prediction. Our approach can build on any existing lasso solver for small or moderate-sized problems, scale it up to a big-data solution, and incorporate other extensions easily. We provide a package snpnet that extends the glmnet package in R and optimizes for large phenotype-genotype data. On the UK Biobank, we observe improved prediction performance on height, body mass index (BMI), asthma and high cholesterol by the lasso over other univariate and multiple regression methods. That said, the scope of our approach goes beyond genetic studies. It can be applied to general sparse regression problems and build scalable solution for a variety of distribution families based on existing solvers.

[1]  H. D. Patterson,et al.  Recovery of inter-block information when block sizes are unequal , 1971 .

[2]  D. Cox Regression Models and Life-Tables , 1972 .

[3]  H. Wold Soft Modelling by Latent Variables: The Non-Linear Iterative Partial Least Squares (NIPALS) Approach , 1975, Journal of Applied Probability.

[4]  E. DeLong,et al.  Comparing the areas under two or more correlated receiver operating characteristic curves: a nonparametric approach. , 1988, Biometrics.

[5]  R. Tibshirani Regression Shrinkage and Selection via the Lasso , 1996 .

[6]  Jennifer R. Harris,et al.  Heritability of adult body height: a comparative study of twin cohorts in eight countries. , 2003, Twin research : the official journal of the International Society for Twin Studies.

[7]  Eric R. Ziegel,et al.  The Elements of Statistical Learning , 2003, Technometrics.

[8]  Mehryar Mohri,et al.  Confidence Intervals for the Area Under the ROC Curve , 2004, NIPS.

[9]  H. Zou,et al.  Regularization and variable selection via the elastic net , 2005 .

[10]  D. Reich,et al.  Population Structure and Eigenanalysis , 2006, PLoS genetics.

[11]  Stephen P. Boyd,et al.  Convex Optimization , 2004, Algorithms and Theory of Computation Handbook.

[12]  P. Visscher,et al.  Bias, precision and heritability of self-reported and clinically measured height in Australian twins , 2006, Human Genetics.

[13]  Jianqing Fan,et al.  Sure independence screening for ultrahigh dimensional feature space , 2006, math/0612857.

[14]  Manuel A. R. Ferreira,et al.  Assumption-Free Estimation of Heritability from Genome-Wide Identity-by-Descent Sharing between Full Siblings , 2006, PLoS genetics.

[15]  D. Reich,et al.  Principal components analysis corrects for stratification in genome-wide association studies , 2006, Nature Genetics.

[16]  Peng Zhao,et al.  On Model Selection Consistency of Lasso , 2006, J. Mach. Learn. Res..

[17]  Nicolai Meinshausen,et al.  Relaxed Lasso , 2007, Comput. Stat. Data Anal..

[18]  C. Robert Discussion of "Sure independence screening for ultra-high dimensional feature space" by Fan and Lv. , 2008 .

[19]  Sanjay Ghemawat,et al.  MapReduce: simplified data processing on large clusters , 2008, CACM.

[20]  W. G. Hill,et al.  Heritability in the genomics era — concepts and misconceptions , 2008, Nature Reviews Genetics.

[21]  Lin Xiao,et al.  Dual Averaging Methods for Regularized Stochastic Learning and Online Optimization , 2009, J. Mach. Learn. Res..

[22]  P. Bickel,et al.  SIMULTANEOUS ANALYSIS OF LASSO AND DANTZIG SELECTOR , 2008, 0801.1095.

[23]  Robert Tibshirani,et al.  The Elements of Statistical Learning: Data Mining, Inference, and Prediction, 2nd Edition , 2001, Springer Series in Statistics.

[24]  Trevor J. Hastie,et al.  Genome-wide association analysis by lasso penalized logistic regression , 2009, Bioinform..

[25]  Trevor Hastie,et al.  Regularization Paths for Generalized Linear Models via Coordinate Descent. , 2010, Journal of statistical software.

[26]  Henrik,et al.  Association analyses of 249,796 individuals reveal eighteen new loci associated with body mass index , 2012 .

[27]  Ayellet V. Segrè,et al.  Hundreds of variants clustered in genomic loci and biological pathways affect human height , 2010, Nature.

[28]  P. Visscher,et al.  From Galton to GWAS: quantitative genetics of human height. , 2010, Genetics research.

[29]  Laurent El Ghaoui,et al.  Safe Feature Elimination for the LASSO and Sparse Supervised Learning Problems , 2010, 1009.4219.

[30]  P. Visscher,et al.  Common SNPs explain a large proportion of heritability for human height , 2011 .

[31]  Scott Shenker,et al.  Spark: Cluster Computing with Working Sets , 2010, HotCloud.

[32]  P. Visscher,et al.  GCTA: a tool for genome-wide complex trait analysis. , 2011, American journal of human genetics.

[33]  P. Visscher,et al.  Estimating missing heritability for disease from genome-wide association studies. , 2011, American journal of human genetics.

[34]  Jian Huang,et al.  COORDINATE DESCENT ALGORITHMS FOR NONCONVEX PENALIZED REGRESSION, WITH APPLICATIONS TO BIOLOGICAL FEATURE SELECTION. , 2011, The annals of applied statistics.

[35]  Martin J. Wainwright,et al.  Dual Averaging for Distributed Optimization: Convergence Analysis and Network Scaling , 2010, IEEE Transactions on Automatic Control.

[36]  R. Tibshirani,et al.  Strong rules for discarding predictors in lasso‐type problems , 2010, Journal of the Royal Statistical Society. Series B, Statistical methodology.

[37]  N. Patterson,et al.  Using Extended Genealogy to Estimate Components of Heritability for 23 Quantitative and Dichotomous Traits , 2013, PLoS genetics.

[38]  P. Visscher,et al.  Inference of the genetic architecture underlying BMI and height with the use of 20,240 sibling pairs. , 2013, American journal of human genetics.

[39]  Stephen Weston,et al.  Scalable Strategies for Computing with Massive Data , 2013 .

[40]  Ross M. Fraser,et al.  Defining the role of common variation in the genomic and biological architecture of adult human height , 2014, Nature Genetics.

[41]  Stephen P. Boyd,et al.  Proximal Algorithms , 2013, Found. Trends Optim..

[42]  P. Visscher,et al.  Nature Genetics Advance Online Publication , 2022 .

[43]  Jie Wang,et al.  Lasso screening rules via dual polytope projection , 2012, J. Mach. Learn. Res..

[44]  P. Visscher,et al.  Genetic variance estimation with imputed variants finds negligible missing heritability for human height and body mass index , 2015, Nature Genetics.

[45]  Carson C Chow,et al.  Second-generation PLINK: rising to the challenge of larger and richer datasets , 2014, GigaScience.

[46]  Ross M. Fraser,et al.  Genetic studies of body mass index yield new insights for obesity biology , 2015, Nature.

[47]  Trevor Hastie,et al.  Computer Age Statistical Inference: Algorithms, Evidence, and Data Science , 2016 .

[48]  B. Neale,et al.  Phenome-wide Heritability Analysis of the UK Biobank , 2016, bioRxiv.

[49]  P. Visscher,et al.  10 Years of GWAS Discovery: Biology, Function, and Translation. , 2017, American journal of human genetics.

[50]  Marcelo P. Segura-Lepe,et al.  Rare and low-frequency coding variants alter human adult height , 2016, Nature.

[51]  Andrey Ziyatdinov,et al.  Efficient analysis of large-scale genome-wide data with two R packages: bigstatsr and bigsnpr , 2018, Bioinform..

[52]  M. Rivas,et al.  Medical relevance of protein-truncating variants across 337,205 individuals in the UK Biobank study , 2018, Nature Communications.

[53]  Luke R. Lloyd-Jones,et al.  Signatures of negative selection in the genetic architecture of human complex traits , 2018, Nature Genetics.

[54]  Tian Ge,et al.  Polygenic Prediction via Bayesian Regression and Continuous Shrinkage Priors , 2018 .

[55]  Louis Lello,et al.  Accurate Genomic Prediction of Human Height , 2017, Genetics.

[56]  P. Donnelly,et al.  The UK Biobank resource with deep phenotyping and genomic data , 2018, Nature.

[57]  Christopher M. DeBoever,et al.  Components of genetic associations across 2,138 phenotypes in the UK Biobank highlight novel adipocyte biology , 2018, bioRxiv.

[58]  Stephen D. Turner,et al.  qqman: an R package for visualizing GWAS results using Q-Q and manhattan plots , 2014, bioRxiv.

[59]  Peter Z. G. Qian,et al.  Fast Penalized Regression and Cross Validation for Tall Data with the oem Package , 2018, J. Stat. Softw..

[60]  Naomi R. Wray,et al.  Improved polygenic prediction by Bayesian multiple regression on summary statistics , 2019, Nature Communications.

[61]  Christopher M. DeBoever,et al.  Components of genetic associations across 2,138 phenotypes in the UK Biobank highlight adipocyte biology , 2019, Nature Communications.

[62]  Yaohui Zeng,et al.  The biglasso Package: A Memory- and Computation-Efficient Solver for Lasso Model Fitting with Big Data in R , 2017, R J..

[63]  Trevor Hastie,et al.  Fast Lasso method for Large-scale and Ultrahigh-dimensional Cox Model with applications to UK Biobank , 2020, bioRxiv.

[64]  D.,et al.  Regression Models and Life-Tables , 2022 .