A Random Forests Framework for Modeling Haplotypes as Mosaics of Reference Haplotypes

Many genomic data analyses such as phasing, genotype imputation, or local ancestry inference share a common core task: matching pairs of haplotypes at any position along the chromosome, thereby inferring a target haplotype as a succession of pieces from reference haplotypes, commonly called a mosaic of reference haplotypes. For that purpose, these analyses combine information provided by linkage disequilibrium, linkage and/or genealogy through a set of heuristic rules or, most often, by a hidden Markov model. Here, we develop an extremely randomized trees framework to address the issue of local haplotype matching. In our approach, a supervised classifier using extra-trees (a particular type of random forests) learns how to identify the best local matches between haplotypes using a collection of observed examples. For each example, various features related to the different sources of information are observed, such as the length of a segment shared between haplotypes, or estimates of relationships between individuals, gametes, and haplotypes. The random forests framework was fed with 30 relevant features for local haplotype matching. Repeated cross-validations allowed ranking these features in regard to their importance for local haplotype matching. The distance to the edge of a segment shared by both haplotypes being matched was found to be the most important feature. Similarity comparisons between predicted and true whole-genome sequence haplotypes showed that the random forests framework was more efficient than a hidden Markov model in reconstructing a target haplotype as a mosaic of reference haplotypes. To further evaluate its efficiency, the random forests framework was applied to imputation of whole-genome sequence from 50k genotypes and it yielded average reliabilities similar or slightly better than IMPUTE2. Through this exploratory study, we lay the foundations of a new framework to automatically learn local haplotype matching and we show that extra-trees are a promising approach for such purposes. The use of this new technique also reveals some useful lessons on the relevant features for the purpose of haplotype matching. We also discuss potential improvements for routine implementation.

[1]  F. Farnir,et al.  Modeling of Identity-by-Descent Processes Along a Chromosome Between Haplotypes and Their Genotyped Ancestors , 2011, Genetics.

[2]  Joshua T. Burdick,et al.  In silico method for inferring genotypes in pedigrees , 2006, Nature Genetics.

[3]  Lawrence R. Rabiner,et al.  A tutorial on hidden Markov models and selected applications in speech recognition , 1989, Proc. IEEE.

[4]  Leo Breiman,et al.  Random Forests , 2001, Machine Learning.

[5]  Pedro C. Avila,et al.  Fast and accurate inference of local ancestry in Latino populations , 2012, Bioinform..

[6]  John A Woolliams,et al.  Imputation of Missing Genotypes From Sparse to High Density Using Long-Range Phasing , 2011, Genetics.

[7]  P. Donnelly,et al.  A Flexible and Accurate Genotype Imputation Method for the Next Generation of Genome-Wide Association Studies , 2009, PLoS genetics.

[8]  D. Reich,et al.  Sensitive Detection of Chromosomal Segments of Distinct Ancestry in Admixed Populations , 2009, PLoS genetics.

[9]  C. Bustamante,et al.  RFMix: a discriminative modeling approach for rapid and robust local-ancestry inference. , 2013, American journal of human genetics.

[10]  William Stafford Noble,et al.  Machine learning applications in genetics and genomics , 2015, Nature Reviews Genetics.

[11]  P. Donnelly,et al.  A new multipoint method for genome-wide association studies by imputation of genotypes , 2007, Nature Genetics.

[12]  A. C. Collins,et al.  A method for fine mapping quantitative trait loci in outbred animal stocks. , 2000, Proceedings of the National Academy of Sciences of the United States of America.

[13]  Peter Donnelly,et al.  A Bayesian Method for Detecting and Characterizing Allelic Heterogeneity and Boosting Signals in Genome-Wide Association Studies , 2009, 1010.4670.

[14]  Aixia Guo,et al.  Gene Selection for Cancer Classification using Support Vector Machines , 2014 .

[15]  B. Browning,et al.  A unified approach to genotype imputation and haplotype-phase inference for large data sets of trios and unrelated individuals. , 2009, American journal of human genetics.

[16]  Sewall Wright,et al.  Coefficients of Inbreeding and Relationship , 1922, The American Naturalist.

[17]  Pierre Geurts,et al.  Extremely randomized trees , 2006, Machine Learning.

[18]  M. Georges,et al.  NGS-based reverse genetic screen for common embryonic lethal mutations compromising fertility in livestock , 2016, Genome research.

[19]  Gaël Varoquaux,et al.  Scikit-learn: Machine Learning in Python , 2011, J. Mach. Learn. Res..

[20]  David Cohn,et al.  Active Learning , 2010, Encyclopedia of Machine Learning.

[21]  Tom Druet,et al.  A Hidden Markov Model Combining Linkage and Linkage Disequilibrium Information for Haplotype Reconstruction and Quantitative Trait Locus Fine Mapping , 2010, Genetics.

[22]  Pall I. Olason,et al.  Detection of sharing by descent, long-range phasing and haplotype imputation , 2008, Nature Genetics.

[23]  M. Goddard,et al.  Prediction of identity by descent probabilities from marker-haplotypes , 2001, Genetics Selection Evolution.

[24]  M. Goddard,et al.  The Use of Family Relationships and Linkage Disequilibrium to Impute Phase and Missing Genotypes in Up to Whole-Genome Sequence Density Genotypic Data , 2010, Genetics.

[25]  D. Falush,et al.  Inference of Population Structure using Dense Haplotype Data , 2012, PLoS genetics.

[26]  Paul Scheet,et al.  A fast and flexible statistical model for large-scale population genotype data: applications to inferring missing genotypes and haplotypic phase. , 2006, American journal of human genetics.

[27]  O. Delaneau,et al.  A linear complexity phasing method for thousands of genomes , 2011, Nature Methods.

[28]  R. Fernando,et al.  Covariance between relatives for a marked quantitative trait locus , 1995, Genetics Selection Evolution.

[29]  T. Druet,et al.  A strategy to improve phasing of whole-genome sequenced individuals through integration of familial information from dense genotype panels , 2017, Genetics Selection Evolution.

[30]  J. Gibson,et al.  The Inverse of the Gametic Relationship Matrix , 1989 .

[31]  D. Balding,et al.  Relatedness in the post-genomic era: is it still useful? , 2014, Nature Reviews Genetics.

[32]  Martin P. Boer,et al.  Reconstruction of Genome Ancestry Blocks in Multiparental Populations , 2015, Genetics.

[33]  F. Schenkel,et al.  A new approach for efficient genotype imputation using information from relatives , 2014, BMC Genomics.

[34]  P. Visscher,et al.  Common SNPs explain a large proportion of heritability for human height , 2011 .