HapFABIA: Identification of very short segments of identity by descent characterized by rare variants in large sequencing data

Identity by descent (IBD) can be reliably detected for long shared DNA segments, which are found in related individuals. However, many studies contain cohorts of unrelated individuals that share only short IBD segments. New sequencing technologies facilitate identification of short IBD segments through rare variants, which convey more information on IBD than common variants. Current IBD detection methods, however, are not designed to use rare variants for the detection of short IBD segments. Short IBD segments reveal genetic structures at high resolution. Therefore, they can help to improve imputation and phasing, to increase genotyping accuracy for low-coverage sequencing and to increase the power of association studies. Since short IBD segments are further assumed to be old, they can shed light on the evolutionary history of humans. We propose HapFABIA, a computational method that applies biclustering to identify very short IBD segments characterized by rare variants. HapFABIA is designed to detect short IBD segments in genotype data that were obtained from next-generation sequencing, but can also be applied to DNA microarray data. Especially in next-generation sequencing data, HapFABIA exploits rare variants for IBD detection. HapFABIA significantly outperformed competing algorithms at detecting short IBD segments on artificial and simulated data with rare variants. HapFABIA identified 160 588 different short IBD segments characterized by rare variants with a median length of 23 kb (mean 24 kb) in data for chromosome 1 of the 1000 Genomes Project. These short IBD segments contain 752 000 single nucleotide variants (SNVs), which account for 39% of the rare variants and 23.5% of all variants. The vast majority—152 000 IBD segments—are shared by Africans, while only 19 000 and 11 000 are shared by Europeans and Asians, respectively. IBD segments that match the Denisova or the Neandertal genome are found significantly more often in Asians and Europeans but also, in some cases exclusively, in Africans. The lengths of IBD segments and their sharing between continental populations indicate that many short IBD segments from chromosome 1 existed before humans migrated out of Africa. Thus, rare variants that tag these short IBD segments predate human migration from Africa. The software package HapFABIA is available from Bioconductor. All data sets, result files and programs for data simulation, preprocessing and evaluation are supplied at http://www.bioinf.jku.at/research/short-IBD.

[1]  B. Heller,et al.  Statistics for experimenters, an introduction to design, data analysis, and model building : G.E.P. Box, W.G. Hunter and J.S. Hunter, John Wiley and Sons, New York, NY. 1978. , 1986 .

[2]  A. Thomas,et al.  Genomic mismatch scanning in pedigrees. , 1994, IMA journal of mathematics applied in medicine and biology.

[3]  Rappold,et al.  Human Molecular Genetics , 1996, Nature Medicine.

[4]  T. Meerman,et al.  Haplotype sharing analysis in affected individuals from nuclear families with at least one affected offspring , 1997 .

[5]  D. Chakrabarti,et al.  A fast fixed - point algorithm for independent component analysis , 1997 .

[6]  G. T. te Meerman,et al.  Haplotype sharing analysis in affected individuals from nuclear families with at least one affected offspring , 1997, Genetic epidemiology.

[7]  S. P. Fodor,et al.  Determination of ancestral alleles for human single-nucleotide polymorphisms using high-density oligonucleotide arrays , 1999, Nature Genetics.

[8]  R S Judson,et al.  Complex promoter and coding region beta 2-adrenergic receptor haplotypes alter receptor expression and predict in vivo responsiveness. , 2000, Proceedings of the National Academy of Sciences of the United States of America.

[9]  J. Gilbert,et al.  SNPing away at complex diseases: analysis of single-nucleotide polymorphisms around APOE in Alzheimer disease. , 2000, American journal of human genetics.

[10]  Mark A. Girolami,et al.  A Variational Method for Learning Sparse and Overcomplete Representations , 2001, Neural Computation.

[11]  M. Goddard,et al.  Prediction of identity by descent probabilities from marker-haplotypes , 2001, Genetics Selection Evolution.

[12]  P. Deloukas,et al.  Comparison of human genetic and sequence-based physical maps , 2001, Nature.

[13]  G. Abecasis,et al.  Merlin—rapid analysis of dense genetic maps using sparse gene flow trees , 2002, Nature Genetics.

[14]  S. Gabriel,et al.  The Structure of Haplotype Blocks in the Human Genome , 2002, Science.

[15]  J. Wall,et al.  Haplotype blocks and linkage disequilibrium in the human genome , 2003, Nature Reviews Genetics.

[16]  M. Stephens,et al.  Accounting for Decay of Linkage Disequilibrium in Haplotype Inference and Missing-data Imputation , 2022 .

[17]  Bhaskar D. Rao,et al.  Variational EM Algorithms for Non-Gaussian Latent Variable Models , 2005, NIPS.

[18]  G. McVean,et al.  Approximating the coalescent with recombination , 2005, Philosophical Transactions of the Royal Society B: Biological Sciences.

[19]  Wentian Li,et al.  Comparing single-nucleotide polymorphism marker-based and microsatellite marker-based linkage analyses , 2005, BMC Genetics.

[20]  S. Gabriel,et al.  Calibrating a coalescent simulation of human genome sequence variation. , 2005, Genome research.

[21]  Jean L. Chang,et al.  Initial sequence of the chimpanzee genome and comparison with the human genome , 2005, Nature.

[22]  Lothar Thiele,et al.  A systematic comparison and evaluation of biclustering methods for gene expression data , 2006, Bioinform..

[23]  Klaus Obermayer,et al.  A new summarization method for affymetrix probe level data , 2006, Bioinform..

[24]  Lon R Cardon,et al.  Evaluating coverage of genome-wide association studies , 2006, Nature Genetics.

[25]  Paul Scheet,et al.  A fast and flexible statistical model for large-scale population genotype data: applications to inferring missing genotypes and haplotypic phase. , 2006, American journal of human genetics.

[26]  Hinrich W. H. Göhlmann,et al.  I/NI-calls for the exclusion of non-informative genes: a highly effective filtering tool for microarray data , 2007, Bioinform..

[27]  Marek Kimmel,et al.  Forward-Time Simulations of Human Populations with Complex Diseases , 2007, PLoS genetics.

[28]  Manuel A. R. Ferreira,et al.  PLINK: a tool set for whole-genome association and population-based linkage analyses. , 2007, American journal of human genetics.

[29]  Gonçalo R. Abecasis,et al.  GENOME: a rapid coalescent-based whole genome simulator , 2007, Bioinform..

[30]  Sharon R Browning,et al.  Estimation of Pairwise Identity by Descent From Dense Genetic Marker Data in a Population Sample of Haplotypes , 2008, Genetics.

[31]  A. Albrechtsen,et al.  A common Greenlandic Inuit BRCA1 RING domain founder mutation , 2009, Breast Cancer Research and Treatment.

[32]  Gregory Leibon,et al.  A SNP Streak Model for the Identification of Genetic Regions Identical-by-descent , 2008, Statistical applications in genetics and molecular biology.

[33]  K Allen-Brady,et al.  Shared Genomic Segment Analysis. Mapping Disease Predisposition Genes in Extended Pedigrees Using SNP Genotype Assays , 2008, Annals of human genetics.

[34]  Montgomery Slatkin,et al.  Linkage disequilibrium — understanding the evolutionary past and mapping the medical future , 2008, Nature Reviews Genetics.

[35]  Ryan D. Hernandez,et al.  A flexible forward simulator for populations subject to selection and demography , 2008, Bioinform..

[36]  Alexander Gusev,et al.  Whole population, genome-wide mapping of hidden relatedness. , 2009, Genome research.

[37]  Zhaoxia Yu,et al.  Simultaneous genotype calling and haplotype phasing improves genotype accuracy and reduces false-positive associations for genome-wide association studies. , 2009, American journal of human genetics.

[38]  B. Browning,et al.  A unified approach to genotype imputation and haplotype-phase inference for large data sets of trios and unrelated individuals. , 2009, American journal of human genetics.

[39]  Stephen L. Hauser,et al.  Genome-wide patterns of population structure and admixture in West Africans and African Americans , 2009, Proceedings of the National Academy of Sciences.

[40]  Eleazar Eskin,et al.  Linkage Effects and Analysis of Finite Sample Errors in the HapMap , 2009, Human Heredity.

[41]  Gonçalo R. Abecasis,et al.  The Sequence Alignment/Map format and SAMtools , 2009, Bioinform..

[42]  Ying Xu,et al.  QUBIC: a qualitative biclustering algorithm for analyses of gene expression data , 2009, Nucleic acids research.

[43]  Niko Beerenwinkel,et al.  Error correction of next-generation sequencing data and reliable estimation of HIV quasispecies , 2010, Nucleic acids research.

[44]  A. Albrechtsen,et al.  Identification of a novel BRCA1 nucleotide 4803delCC/c.4684delCC mutation and a nucleotide 249T>A/c.130T>A (p.Cys44Ser) mutation in two Greenlandic Inuit families: implications for genetic screening of Greenlandic Inuit families with high risk for breast and/or ovarian cancer , 2010, Breast Cancer Research and Treatment.

[45]  Philip L. F. Johnson,et al.  Genetic history of an archaic hominin group from Denisova Cave in Siberia , 2010, Nature.

[46]  Philip L. F. Johnson,et al.  A Draft Sequence of the Neandertal Genome , 2010, Science.

[47]  Ulrich Bodenhofer,et al.  FABIA: factor analysis for bicluster acquisition , 2010, Bioinform..

[48]  A. Gylfason,et al.  Fine-scale recombination rate differences between sexes, populations and individuals , 2010, Nature.

[49]  H. Hakonarson,et al.  ANNOVAR: functional annotation of genetic variants from high-throughput sequencing data , 2010, Nucleic acids research.

[50]  Adetayo Kasim,et al.  Filtering data from high-throughput experiments based on measurement reliability , 2010, Proceedings of the National Academy of Sciences.

[51]  J. Wason,et al.  Comparison of multimarker logistic regression models, with application to a genomewide scan of schizophrenia , 2010, BMC Genetics.

[52]  Daniel Shriner,et al.  A unified framework for multi-locus association analysis of both common and rare variants , 2010, BMC Genomics.

[53]  Anders Albrechtsen,et al.  Natural Selection and the Distribution of Identity-by-Descent in the Human Genome , 2010, Genetics.

[54]  Xin Li,et al.  Efficient identification of identical-by-descent status in pedigrees with many untyped individuals , 2010, Bioinform..

[55]  Gregory Ewing,et al.  MSMS: a coalescent simulation program including recombination, demographic structure and selection at a single locus , 2010, Bioinform..

[56]  Si Quang Le,et al.  SNP detection and genotyping from low-coverage sequencing data on multiple diploid samples. , 2011, Genome research.

[57]  Sebastian Bauer,et al.  Identity-by-descent filtering of exome sequence data for disease–gene identification in autosomal recessive disorders , 2011, Bioinform..

[58]  Alexander Gusev,et al.  DASH: a method for identical-by-descent haplotype mapping uncovers association with recent variation. , 2011, American journal of human genetics.

[59]  Analysis of exome sequences with and without incorporating prior biological knowledge , 2011, Genetic epidemiology.

[60]  Willem Talloen,et al.  cn.FARMS: a latent variable model to detect copy number variations in microarray data with a low false discovery rate , 2011, Nucleic acids research.

[61]  D. Arnett,et al.  A Powerful Test of Parent-of-Origin Effects for Quantitative Traits Using Haplotypes , 2011, PloS one.

[62]  B. Browning,et al.  Haplotype phasing: existing methods and new developments , 2011, Nature Reviews Genetics.

[63]  Ion I. Mandoiu,et al.  Linkage disequilibrium based genotype calling from low-coverage shotgun sequencing reads , 2011, BMC Bioinformatics.

[64]  Mattias Jakobsson,et al.  Deep divergences of human gene trees and models of human origins. , 2011, Molecular biology and evolution.

[65]  August E. Woerner,et al.  Genetic evidence for archaic admixture in Africa , 2011, Proceedings of the National Academy of Sciences.

[66]  Sorin Istrail,et al.  The Clark Phase-able Sample Size Problem: Long-Range Phasing and Loss of Heterozygosity in GWAS , 2010, RECOMB.

[67]  Anders Albrechtsen,et al.  A method for detecting IBD regions simultaneously in multiple individuals--with applications to disease genetics. , 2011, Genome research.

[68]  Sergio Baranzini,et al.  Detection of identity by descent using next-generation whole genome sequencing data , 2012, BMC Bioinformatics.

[69]  B. Browning,et al.  A fast, powerful method for detecting identity by descent. , 2011, American journal of human genetics.

[70]  Gabor T. Marth,et al.  Demographic history and rare allele sharing among human populations , 2011, Proceedings of the National Academy of Sciences.

[71]  Nianjun Liu,et al.  Genotype calling from next-generation sequencing data using haplotype information of reads , 2012, Bioinform..

[72]  Brian L Browning,et al.  Identity by descent between distant relatives: detection and applications. , 2012, Annual review of genetics.

[73]  I. Pe’er,et al.  Length distributions of identity by descent reveal fine-scale demographic history. , 2012, American journal of human genetics.

[74]  S. Hochreiter,et al.  cn.MOPS: mixture of Poissons for discovering copy number variations in next-generation sequencing data with a low false discovery rate , 2012, Nucleic acids research.

[75]  S. Tishkoff,et al.  Evolutionary History and Adaptation from High-Coverage Whole-Genome Sequences of Diverse African Hunter-Gatherers , 2012, Cell.

[76]  Adrian W. Briggs,et al.  A High-Coverage Genome Sequence from an Archaic Denisovan Individual , 2012, Science.

[77]  Oscar E. Gaggiotti,et al.  Computer simulations: tools for population and evolutionary genetics , 2012, Nature Reviews Genetics.

[78]  Kenny Q. Ye,et al.  An integrated map of genetic variation from 1,092 human genomes , 2012, Nature.

[79]  Ryan D. Hernandez,et al.  Population Genetics of Rare Variants and Complex Diseases , 2013, Human Heredity.

[80]  Alexander Gusev,et al.  The architecture of long-range haplotypes shared within and across populations. , 2012, Molecular biology and evolution.

[81]  E. Pennisi Human evolution. More genomes from Denisova Cave show mixing of early human groups. , 2013, Science.

[82]  Peter L. Ralph,et al.  The Geography of Recent Genetic Ancestry across Europe , 2012, PLoS biology.

[83]  S. Gabriel,et al.  Analysis of 6,515 exomes reveals a recent origin of most human protein-coding variants , 2012, Nature.

[84]  August E. Woerner,et al.  Higher Levels of Neanderthal Ancestry in East Asians than in Europeans , 2013, Genetics.

[85]  Peter Donnelly,et al.  Multiple Instances of Ancient Balancing Selection Shared Between Humans and Chimpanzees , 2013, Science.