Significant sparse polygenic risk scores across 813 traits in UK Biobank

We present a systematic assessment of polygenic risk score (PRS) prediction across more than 1,500 traits using genetic and phenotype data in the UK Biobank. We report 813 sparse PRS models with significant (p < 2.5 x 10−5) incremental predictive performance when compared against the covariate-only model that considers age, sex, types of genotyping arrays, and the principal component loadings of genotypes. We report a significant correlation between the number of genetic variants selected in the sparse PRS model and the incremental predictive performance (Spearman’s ⍴ = 0.61, p = 2.2 x 10−59 for quantitative traits, ⍴ = 0.21, p = 9.6 x 10−4 for binary traits). The sparse PRS model trained on European individuals showed limited transferability when evaluated on non-European individuals in the UK Biobank. We provide the PRS model weights on the Global Biobank Engine (https://biobankengine.stanford.edu/prs).

[1]  D. Rowitch,et al.  MC3R links nutritional state to childhood growth and the timing of puberty , 2021, Nature.

[2]  J. Marchini,et al.  Exome sequencing and analysis of 454,787 UK Biobank participants , 2021, Nature.

[3]  R. Tibshirani,et al.  Fast Numerical Optimization for Genome Sequencing Data in Population Biobanks , 2021, bioRxiv.

[4]  Trevor Hastie,et al.  Genetics of 35 blood and urine biomarkers in the UK Biobank , 2020, Nature Genetics.

[5]  P. Visscher,et al.  From Basic Science to Clinical Application of Polygenic Risk Scores: A Primer. , 2020, JAMA psychiatry.

[6]  Trevor Hastie,et al.  Fast Lasso method for large-scale and ultrahigh-dimensional Cox model with applications to UK Biobank. , 2020, Biostatistics.

[7]  R. Mägi,et al.  Genomic architecture and prediction of censored time-to-event phenotypes with a Bayesian genome-wide analysis , 2020, Nature Communications.

[8]  Gonçalo Abecasis,et al.  Computationally efficient whole-genome regression for quantitative and binary traits , 2020, Nature Genetics.

[9]  Christopher M. DeBoever,et al.  Pervasive additive and non-additive effects within the HLA region contribute to disease risk in the UK Biobank , 2020, bioRxiv.

[10]  S. A. Lambert,et al.  The Polygenic Score Catalog: an open database for reproducibility and systematic evaluation , 2020, medRxiv.

[11]  E. Vassos,et al.  Polygenic risk scores: from research tools to clinical instruments , 2020, Genome Medicine.

[12]  C. Kooperberg,et al.  Improving reporting standards for polygenic scores in risk prediction studies , 2020, Nature.

[13]  Bjarni J. Vilhjálmsson,et al.  LDpred2: better, faster, stronger , 2020, bioRxiv.

[14]  Audrey Y. Chu,et al.  Polygenic and clinical risk scores and their impact on age at onset and prediction of cardiometabolic diseases and common cancers , 2020, Nature Medicine.

[15]  Trevor Hastie,et al.  Fast Lasso method for Large-scale and Ultrahigh-dimensional Cox Model with applications to UK Biobank , 2020, bioRxiv.

[16]  Kohske Takahashi,et al.  Welcome to the Tidyverse , 2019, J. Open Source Softw..

[17]  Astrid Gall,et al.  Ensembl 2020 , 2019, Nucleic Acids Res..

[18]  Trevor Hastie,et al.  Polygenic risk modeling with latent trait-related genetic components , 2019, European Journal of Human Genetics.

[19]  Christopher M. DeBoever,et al.  Assessing digital phenotyping to enhance genetic studies of human diseases , 2019, bioRxiv.

[20]  M. Rivas,et al.  Phenome-wide Burden of Copy Number Variation in the UK Biobank. , 2019, American journal of human genetics.

[21]  P. O’Reilly,et al.  PRSice-2: Polygenic Risk Score software for biobank-scale data , 2019, GigaScience.

[22]  M. Rivas,et al.  Rare protein-altering variants in ANGPTL7 lower intraocular pressure and protect against glaucoma , 2019, bioRxiv.

[23]  Trevor Hastie,et al.  A fast and scalable framework for large-scale and ultrahigh-dimensional sparse regression with application to the UK Biobank , 2019, bioRxiv.

[24]  Alicia R. Martin,et al.  Clinical use of current polygenic risk scores may exacerbate health disparities , 2019, Nature Genetics.

[25]  Jun Chen,et al.  Efficient cross-trait penalized regression increases prediction accuracy in large cohorts using secondary phenotypes , 2019, Nature Communications.

[26]  Ryan L. Collins,et al.  The mutational constraint spectrum quantified from variation in 141,456 humans , 2020, Nature.

[27]  Naomi R. Wray,et al.  Improved polygenic prediction by Bayesian multiple regression on summary statistics , 2019, Nature Communications.

[28]  J. Lachance,et al.  Genetic disease risks can be misestimated across global populations , 2018, Genome Biology.

[29]  G. Davey Smith,et al.  An atlas of polygenic risk score associations to highlight putative causal relationships across the human phenome , 2018, bioRxiv.

[30]  Christopher M. DeBoever,et al.  Components of genetic associations across 2,138 phenotypes in the UK Biobank highlight adipocyte biology , 2019, Nature Communications.

[31]  P. Donnelly,et al.  The UK Biobank resource with deep phenotyping and genomic data , 2018, Nature.

[32]  J. Danesh,et al.  Genomic Risk Prediction of Coronary Artery Disease in 480,000 Adults , 2018, Journal of the American College of Cardiology.

[33]  Yang Ni,et al.  Polygenic prediction via Bayesian regression and continuous shrinkage priors , 2018, Nature Communications.

[34]  Timothy Shin Heng Mak,et al.  Tutorial: a guide to performing polygenic risk score analyses , 2018, bioRxiv.

[35]  Mary E. Haas,et al.  Genome-wide polygenic scores for common diseases identify individuals with risk equivalent to monogenic mutations , 2018, Nature Genetics.

[36]  Mary E. Haas,et al.  Analysis of predicted loss-of-function variants in UK Biobank identifies variants protective for disease , 2018, Nature Communications.

[37]  Christopher M. DeBoever,et al.  Global Biobank Engine: enabling genotype-phenotype browsing for biobank summary statistics , 2018, bioRxiv.

[38]  Chunlei Liu,et al.  ClinVar: improving access to variant interpretations and supporting evidence , 2017, Nucleic Acids Res..

[39]  Christopher M. DeBoever,et al.  Medical relevance of protein-truncating variants across 337,205 individuals in the UK Biobank study , 2017, bioRxiv.

[40]  Christopher R. Gignoux,et al.  Human demographic history impacts genetic risk prediction across diverse populations , 2016, bioRxiv.

[41]  Pak Chung Sham,et al.  Polygenic scores via penalized regression on summary statistics , 2016, bioRxiv.

[42]  Daniel G. MacArthur,et al.  Human knockouts and phenotypic analysis in a cohort with a high rate of consanguinity , 2017, Nature.

[43]  F. Cunningham,et al.  The Ensembl Variant Effect Predictor , 2016, bioRxiv.

[44]  Beryl B. Cummings,et al.  A protein-truncating R179X variant in RNF186 confers protection against ulcerative colitis , 2015, Nature Communications.

[45]  Harry Hemingway,et al.  Health and population effects of rare gene knockouts in adult humans with related parents , 2015, Science.

[46]  P. Visscher,et al.  Modeling Linkage Disequilibrium Increases Accuracy of Polygenic Risk Scores , 2015, bioRxiv.

[47]  Yakir A Reshef,et al.  Partitioning heritability by functional annotation using genome-wide association summary statistics , 2015, Nature Genetics.

[48]  Sebastian M. Armasu,et al.  A comprehensive 1000 Genomes-based genome-wide association meta-analysis of coronary artery disease , 2015, Nature Genetics.

[49]  Emily K. Tsang,et al.  Effect of predicted protein-truncating genetic variants on the human transcriptome , 2015, Science.

[50]  P. Elliott,et al.  UK Biobank: An Open Access Resource for Identifying the Causes of a Wide Range of Complex Diseases of Middle and Old Age , 2015, PLoS medicine.

[51]  Carson C Chow,et al.  Second-generation PLINK: rising to the challenge of larger and richer datasets , 2014, GigaScience.

[52]  J. Gu,et al.  Higher risk of uveitis and dactylitis and older age of onset among ankylosing spondylitis patients with HLA-B*2705 than patients with HLA-B*2704 in the Chinese population. , 2013, Tissue antigens.

[53]  P. Visscher,et al.  A Better Coefficient of Determination for Genetic Profile Analysis , 2012, Genetic epidemiology.

[54]  Sarah Edkins,et al.  Dense genotyping identifies and localizes multiple common and rare variant association signals in celiac disease , 2011, Nature Genetics.

[55]  Joshua M. Korn,et al.  Deep resequencing of GWAS loci identifies independent rare variants associated with inflammatory bowel disease , 2011, Nature Genetics.

[56]  Trevor Hastie,et al.  Regularization Paths for Generalized Linear Models via Coordinate Descent. , 2010, Journal of statistical software.

[57]  P. Visscher,et al.  Comparing apples and oranges: equating the power of case‐control and quantitative trait association studies , 2009, Genetic epidemiology.

[58]  Tue Tjur,et al.  Coefficients of Determination in Logistic Regression Models—A New Proposal: The Coefficient of Discrimination , 2009 .

[59]  Joseph T. Glessner,et al.  PennCNV: an integrated hidden Markov model designed for high-resolution copy number variation detection in whole-genome SNP genotyping data. , 2007, Genome research.

[60]  Jonathan C. Cohen,et al.  Sequence variations in PCSK9, low LDL, and protection against coronary heart disease. , 2006, The New England journal of medicine.

[61]  Alexander Pertsemlidis,et al.  Low LDL cholesterol in individuals of African descent resulting from frequent nonsense mutations in PCSK9 , 2005, Nature Genetics.

[62]  N. Nagelkerke,et al.  A note on a general definition of the coefficient of determination , 1991 .

[63]  E. DeLong,et al.  Comparing the areas under two or more correlated receiver operating characteristic curves: a nonparametric approach. , 1988, Biometrics.

[64]  Jacob Cohen,et al.  Applied multiple regression/correlation analysis for the behavioral sciences , 1979 .

[65]  J. G. Cragg,et al.  The Demand for Automobiles , 1970 .

[66]  Xiang Zhu,et al.  Bayesian large-scale multiple regression with summary statistics from genome-wide association studies , 2016, bioRxiv.

[67]  et al.,et al.  Jupyter Notebooks - a publishing format for reproducible computational workflows , 2016, ELPUB.

[68]  R Core Team,et al.  R: A language and environment for statistical computing. , 2014 .

[69]  D. Wakefield,et al.  Acute anterior uveitis and HLA-B27. , 2005, Survey of ophthalmology.