论文信息 - HapFABIA: Identification of very short segments of identity by descent characterized by rare variants in large sequencing data

HapFABIA: Identification of very short segments of identity by descent characterized by rare variants in large sequencing data

Identity by descent (IBD) can be reliably detected for long shared DNA segments, which are found in related individuals. However, many studies contain cohorts of unrelated individuals that share only short IBD segments. New sequencing technologies facilitate identification of short IBD segments through rare variants, which convey more information on IBD than common variants. Current IBD detection methods, however, are not designed to use rare variants for the detection of short IBD segments. Short IBD segments reveal genetic structures at high resolution. Therefore, they can help to improve imputation and phasing, to increase genotyping accuracy for low-coverage sequencing and to increase the power of association studies. Since short IBD segments are further assumed to be old, they can shed light on the evolutionary history of humans. We propose HapFABIA, a computational method that applies biclustering to identify very short IBD segments characterized by rare variants. HapFABIA is designed to detect short IBD segments in genotype data that were obtained from next-generation sequencing, but can also be applied to DNA microarray data. Especially in next-generation sequencing data, HapFABIA exploits rare variants for IBD detection. HapFABIA significantly outperformed competing algorithms at detecting short IBD segments on artificial and simulated data with rare variants. HapFABIA identified 160 588 different short IBD segments characterized by rare variants with a median length of 23 kb (mean 24 kb) in data for chromosome 1 of the 1000 Genomes Project. These short IBD segments contain 752 000 single nucleotide variants (SNVs), which account for 39% of the rare variants and 23.5% of all variants. The vast majority—152 000 IBD segments—are shared by Africans, while only 19 000 and 11 000 are shared by Europeans and Asians, respectively. IBD segments that match the Denisova or the Neandertal genome are found significantly more often in Asians and Europeans but also, in some cases exclusively, in Africans. The lengths of IBD segments and their sharing between continental populations indicate that many short IBD segments from chromosome 1 existed before humans migrated out of Africa. Thus, rare variants that tag these short IBD segments predate human migration from Africa. The software package HapFABIA is available from Bioconductor. All data sets, result files and programs for data simulation, preprocessing and evaluation are supplied at http://www.bioinf.jku.at/research/short-IBD.

S. Hochreiter

[1] B. Heller,et al. Statistics for experimenters, an introduction to design, data analysis, and model building : G.E.P. Box, W.G. Hunter and J.S. Hunter, John Wiley and Sons, New York, NY. 1978. , 1986 .

[2] A. Thomas,et al. Genomic mismatch scanning in pedigrees. , 1994, IMA journal of mathematics applied in medicine and biology.

[3] Rappold,et al. Human Molecular Genetics , 1996, Nature Medicine.

[4] T. Meerman,et al. Haplotype sharing analysis in affected individuals from nuclear families with at least one affected offspring , 1997 .

[5] D. Chakrabarti,et al. A fast fixed - point algorithm for independent component analysis , 1997 .

[6] G. T. te Meerman,et al. Haplotype sharing analysis in affected individuals from nuclear families with at least one affected offspring , 1997, Genetic epidemiology.

[7] S. P. Fodor,et al. Determination of ancestral alleles for human single-nucleotide polymorphisms using high-density oligonucleotide arrays , 1999, Nature Genetics.

[8] R S Judson,et al. Complex promoter and coding region beta 2-adrenergic receptor haplotypes alter receptor expression and predict in vivo responsiveness. , 2000, Proceedings of the National Academy of Sciences of the United States of America.

[9] J. Gilbert,et al. SNPing away at complex diseases: analysis of single-nucleotide polymorphisms around APOE in Alzheimer disease. , 2000, American journal of human genetics.

[10] Mark A. Girolami,et al. A Variational Method for Learning Sparse and Overcomplete Representations , 2001, Neural Computation.

[11] M. Goddard,et al. Prediction of identity by descent probabilities from marker-haplotypes , 2001, Genetics Selection Evolution.