GWAS in a Box: Statistical and Visual Analytics of Structured Associations via GenAMap

With the continuous improvement in genotyping and molecular phenotyping technology and the decreasing typing cost, it is expected that in a few years, more and more clinical studies of complex diseases will recruit thousands of individuals for pan-omic genetic association analyses. Hence, there is a great need for algorithms and software tools that could scale up to the whole omic level, integrate different omic data, leverage rich structure information, and be easily accessible to non-technical users. We present GenAMap, an interactive analytics software platform that 1) automates the execution of principled machine learning methods that detect genome- and phenome-wide associations among genotypes, gene expression data, and clinical or other macroscopic traits, and 2) provides new visualization tools specifically designed to aid in the exploration of association mapping results. Algorithmically, GenAMap is based on a new paradigm for GWAS and PheWAS analysis, termed structured association mapping, which leverages various structures in the omic data. We demonstrate the function of GenAMap via a case study of the Brem and Kruglyak yeast dataset, and then apply it on a comprehensive eQTL analysis of the NIH heterogeneous stock mice dataset and report some interesting findings. GenAMap is available from http://sailing.cs.cmu.edu/genamap.

[1]  Rachel B. Brem,et al.  Integrating large-scale functional genomic data to dissect the complexity of yeast regulatory networks , 2008, Nature Genetics.

[2]  Trevor Hastie,et al.  Regularization Paths for Generalized Linear Models via Coordinate Descent. , 2010, Journal of statistical software.

[3]  Inanç Birol,et al.  De novo transcriptome assembly with ABySS , 2009, Bioinform..

[4]  M. Daly,et al.  Genetic Mapping in Human Disease , 2008, Science.

[5]  Peter Bühlmann,et al.  p-Values for High-Dimensional Regression , 2008, 0811.2177.

[6]  M. McCarthy,et al.  Genome-wide association studies: potential next steps on a genetic journey. , 2008, Human molecular genetics.

[7]  Seunghak Lee,et al.  Leveraging input and output structures for joint mapping of epistatic and marginal eQTLs , 2012, Bioinform..

[8]  P. Shannon,et al.  Cytoscape: a software environment for integrated models of biomolecular interaction networks. , 2003, Genome research.

[9]  Stephen P. Boyd,et al.  Convex Optimization , 2004, Algorithms and Theory of Computation Handbook.

[10]  Mee Young Park,et al.  Penalized logistic regression for detecting gene interactions. , 2008, Biostatistics.

[11]  L. Wasserman,et al.  HIGH DIMENSIONAL VARIABLE SELECTION. , 2007, Annals of statistics.

[12]  W. Hays Statistical theory. , 1968, Annual review of psychology.

[13]  Eric P. Xing,et al.  Tree-Guided Group Lasso for Multi-Task Regression with Structured Sparsity , 2009, ICML.

[14]  E. Schadt Molecular networks as sensors and drivers of common human diseases , 2009, Nature.

[15]  Paul Tseng,et al.  A coordinate gradient descent method for nonsmooth separable minimization , 2008, Math. Program..

[16]  Marc Teboulle,et al.  A Fast Iterative Shrinkage-Thresholding Algorithm for Linear Inverse Problems , 2009, SIAM J. Imaging Sci..

[17]  Peter Kraft,et al.  Identification of a new prostate cancer susceptibility locus on chromosome 8q24 , 2009, Nature Genetics.

[18]  Ben Shneiderman,et al.  The eyes have it: a task by data type taxonomy for information visualizations , 1996, Proceedings 1996 IEEE Symposium on Visual Languages.

[19]  Eric P. Xing,et al.  Finding Genome-Transcriptome-Phenome Associations with Structured Association Mapping and Visualization in GenAMap , 2012, Pacific Symposium on Biocomputing.

[20]  Scott A. Rifkin,et al.  Revealing the architecture of gene regulation: the promise of eQTL studies. , 2008, Trends in genetics : TIG.

[21]  R Core Team,et al.  R: A language and environment for statistical computing. , 2014 .

[22]  Judith A. Blake,et al.  The Mouse Genome Database (MGD): premier model organism resource for mammalian genomics and genetics , 2010, Nucleic Acids Res..

[23]  Eric P Xing,et al.  Enhancing the usability and performance of structured association mapping algorithms using automation, parallelization, and visualization in the GenAMap software system , 2011, BMC Genetics.

[24]  R. Tibshirani,et al.  Sparse inverse covariance estimation with the graphical lasso. , 2008, Biostatistics.

[25]  陈奕欣 Ongoing and future developments at the Universal Protein Resource , 2011 .

[26]  Satoru Kawai,et al.  An Algorithm for Drawing General Undirected Graphs , 1989, Inf. Process. Lett..

[27]  Yurii Nesterov,et al.  Smooth minimization of non-smooth functions , 2005, Math. Program..

[28]  Stephen P. Boyd,et al.  Distributed Optimization and Statistical Learning via the Alternating Direction Method of Multipliers , 2011, Found. Trends Mach. Learn..

[29]  Pan Du,et al.  lumi: a pipeline for processing Illumina microarray , 2008, Bioinform..

[30]  N. Cox,et al.  Trait-Associated SNPs Are More Likely to Be eQTLs: Annotation to Enhance Discovery from GWAS , 2010, PLoS genetics.

[31]  Roger N Rosenberg,et al.  Genome-wide association studies in Alzheimer disease. , 2008, Archives of neurology.

[32]  Nicole Soranzo,et al.  An Integration of Genome-Wide Association Study and Gene Expression Profiling to Prioritize the Discovery of Novel Susceptibility Loci for Osteoporosis-Related Traits , 2010, PLoS genetics.

[33]  S. Horvath,et al.  A General Framework for Weighted Gene Co-Expression Network Analysis , 2005, Statistical applications in genetics and molecular biology.

[34]  S. Horvath,et al.  Variations in DNA elucidate molecular networks that cause disease , 2008, Nature.

[35]  Douglas Thain,et al.  Distributed computing in practice: the Condor experience , 2005, Concurr. Pract. Exp..

[36]  P. Donnelly,et al.  Inference of population structure using multilocus genotype data. , 2000, Genetics.

[37]  Maki Moritani,et al.  Identification of candidate genes in the type 2 diabetes modifier locus using expression QTL. , 2005, Genomics.

[38]  David A. Drubin,et al.  Learning a Prior on Regulatory Potential from eQTL Data , 2009, PLoS genetics.

[39]  Fei Ji,et al.  Convergence of linkage, gene expression and association data demonstrates the influence of the RAR-related orphan receptor alpha (RORA) gene on neovascular AMD: A systems biology based approach , 2010, Vision Research.

[40]  Christopher G. Chute,et al.  A Genome-Wide Association Study of Red Blood Cell Traits Using the Electronic Medical Record , 2010, PloS one.

[41]  Rachael P. Huntley,et al.  QuickGO: a web-based tool for Gene Ontology searching , 2009, Bioinform..

[42]  Rachel B. Brem,et al.  The landscape of genetic complexity across 5,700 gene expression traits in yeast. , 2005, Proceedings of the National Academy of Sciences of the United States of America.

[43]  Russ B. Altman,et al.  Missing value estimation methods for DNA microarrays , 2001, Bioinform..

[44]  Patrick M. Stuart,et al.  Major Histocompatibility Complex (MHC): Mouse , 2015 .

[45]  Martin S. Taylor,et al.  Genome-wide genetic association of complex traits in heterogeneous stock mice , 2006, Nature Genetics.

[46]  E. Englander,et al.  Nuclear depletion of apurinic/apyrimidinic endonuclease 1 (Ape1/Ref-1) is an indicator of energy disruption in neurons. , 2012, Free radical biology & medicine.

[47]  E. Xing,et al.  mStruct: Inference of Population Structure in Light of Both Genetic Admixing and Allele Mutations , 2009, Genetics.

[48]  R. Tibshirani Regression Shrinkage and Selection via the Lasso , 1996 .

[49]  M. Yuan,et al.  Model selection and estimation in regression with grouped variables , 2006 .

[50]  Jean YH Yang,et al.  Bioconductor: open software development for computational biology and bioinformatics , 2004, Genome Biology.

[51]  Ben Shneiderman,et al.  Readings in information visualization - using vision to think , 1999 .

[52]  John D. Storey,et al.  Mapping the Genetic Architecture of Gene Expression in Human Liver , 2008, PLoS biology.

[53]  Tamara Munzner,et al.  MulteeSum: A Tool for Comparative Spatial and Temporal Gene Expression Data , 2010, IEEE Transactions on Visualization and Computer Graphics.

[54]  D. Reich,et al.  Population Structure and Eigenanalysis , 2006, PLoS genetics.

[55]  William Valdar,et al.  A resource for the simultaneous high-resolution mapping of multiple quantitative trait loci in rats: the NIH heterogeneous stock. , 2009, Genome research.

[56]  Eric P. Xing,et al.  Multi-population GWA mapping via multi-task regularized regression , 2010, Bioinform..

[57]  H. Stefánsson,et al.  Genetics of gene expression and its effect on disease , 2008, Nature.

[58]  Martin Kuiper,et al.  BiNGO: a Cytoscape plugin to assess overrepresentation of Gene Ontology categories in Biological Networks , 2005, Bioinform..

[59]  Padhraic Smyth,et al.  Analysis and Visualization of Network Data using JUNG , 2005 .

[60]  Jacques Fellay,et al.  WGAViewer: software for genomic annotation of whole genome association studies. , 2008, Genome research.

[61]  Trevor J. Hastie,et al.  Genome-wide association analysis by lasso penalized logistic regression , 2009, Bioinform..

[62]  Michael Boehnke,et al.  LocusZoom: regional visualization of genome-wide association scan results , 2010, Bioinform..

[63]  L Kruglyak,et al.  A nonparametric approach for mapping quantitative trait loci. , 1995, Genetics.

[64]  Eric P. Xing,et al.  Spectrum: joint bayesian inference of population structure and recombination events , 2007, ISMB/ECCB.

[65]  Judy H. Cho,et al.  Finding the missing heritability of complex diseases , 2009, Nature.

[66]  E. Cantoni Analysis of Robust Quasi-deviances for Generalized Linear Models , 2004 .

[67]  Goust Jm,et al.  Major histocompatibility complex. , 1990 .

[68]  Xi Chen,et al.  Group Sparse Additive Models , 2012, ICML.

[69]  Xi Chen,et al.  Smoothing Proximal Gradient Method for General Structured Sparse Learning , 2011, UAI.

[70]  Manuel A. R. Ferreira,et al.  PLINK: a tool set for whole-genome association and population-based linkage analyses. , 2007, American journal of human genetics.

[71]  Masashi Sugiyama,et al.  Dual-Augmented Lagrangian Method for Efficient Sparse Reconstruction , 2009, IEEE Signal Processing Letters.

[72]  L. Liang,et al.  Mapping complex disease traits with global gene expression , 2009, Nature Reviews Genetics.

[73]  William Valdar,et al.  High resolution mapping of expression QTLs in heterogeneous stock mice in multiple tissues. , 2009, Genome research.

[74]  Seunghak Lee,et al.  Adaptive Multi-Task Lasso: with Application to eQTL Detection , 2010, NIPS.

[75]  Chris North,et al.  The Value of Information Visualization , 2008, Information Visualization.

[76]  R. Tibshirani,et al.  A note on the group lasso and a sparse group lasso , 2010, 1001.0736.

[77]  P. Reynolds,et al.  TTF-1 regulates α5 nicotinic acetylcholine receptor (nAChR) subunits in proximal and distal lung epithelium , 2010, Respiratory research.

[78]  María Martín,et al.  Ongoing and future developments at the Universal Protein Resource , 2010, Nucleic Acids Res..

[79]  K. Roeder,et al.  Screen and clean: a tool for identifying interactions in genome‐wide association studies , 2010, Genetic epidemiology.

[80]  Elizabeth M. Smigielski,et al.  dbSNP: the NCBI database of genetic variation , 2001, Nucleic Acids Res..

[81]  K. Gunsalus,et al.  Network modeling links breast cancer susceptibility and centrosome dysfunction. , 2007, Nature genetics.

[82]  Rachel B. Brem,et al.  Trans-acting regulatory variation in Saccharomyces cerevisiae and the role of transcription factors , 2003, Nature Genetics.

[83]  E. Xing,et al.  Statistical Estimation of Correlated Genome Associations to a Quantitative Trait Network , 2009, PLoS genetics.

[84]  Marylyn D. Ritchie,et al.  PheWAS: demonstrating the feasibility of a phenome-wide scan to discover gene–disease associations , 2010, Bioinform..