SMaSH: a benchmarking toolkit for human genome variant calling

MOTIVATION Computational methods are essential to extract actionable information from raw sequencing data, and to thus fulfill the promise of next-generation sequencing technology. Unfortunately, computational tools developed to call variants from human sequencing data disagree on many of their predictions, and current methods to evaluate accuracy and computational performance are ad hoc and incomplete. Agreement on benchmarking variant calling methods would stimulate development of genomic processing tools and facilitate communication among researchers. RESULTS We propose SMaSH, a benchmarking methodology for evaluating germline variant calling algorithms. We generate synthetic datasets, organize and interpret a wide range of existing benchmarking data for real genomes and propose a set of accuracy and computational performance metrics for evaluating variant calling methods on these benchmarking data. Moreover, we illustrate the utility of SMaSH to evaluate the performance of some leading single-nucleotide polymorphism, indel and structural variant calling algorithms. AVAILABILITY AND IMPLEMENTATION We provide free and open access online to the SMaSH tool kit, along with detailed documentation, at smash.cs.berkeley.edu

[1]  A. Oliphant,et al.  BeadArray technology: enabling an accurate, cost-effective approach to high-throughput genotyping. , 2002, BioTechniques.

[2]  E. Lander,et al.  Finishing the euchromatic sequence of the human genome , 2004 .

[3]  J. Bonfield,et al.  Finishing the euchromatic sequence of the human genome , 2004, Nature.

[4]  M. Olivier A haplotype map of the human genome. , 2003, Nature.

[5]  Terrence S. Furey,et al.  The UCSC Genome Browser Database: update 2006 , 2005, Nucleic Acids Res..

[6]  Timothy B. Stockwell,et al.  The Diploid Genome Sequence of an Individual Human , 2007, PLoS biology.

[7]  Ira M. Hall,et al.  Recurrent DNA copy number variation in the laboratory mouse , 2007, Nature Genetics.

[8]  Zhaohui S. Qin,et al.  A second generation human haplotype map of over 3.1 million SNPs , 2007, Nature.

[9]  D. Altshuler,et al.  Completing the map of human genetic variation , 2007, Nature.

[10]  Joshua M. Korn,et al.  Integrated detection and population-genetic analysis of SNPs and copy number variation , 2008, Nature Genetics.

[11]  D. Watkins-Chow,et al.  Genomic copy number and expression variation within the C57BL/6J inbred mouse strain. , 2007, Genome research.

[12]  Joshua M. Korn,et al.  Mapping and sequencing of structural variation from eight human genomes , 2008, Nature.

[13]  Gonçalo R. Abecasis,et al.  The Sequence Alignment/Map format and SAMtools , 2009, Bioinform..

[14]  Peter A. Meric,et al.  Lineage-Specific Biology Revealed by a Finished Genome Assembly of the Mouse , 2009, PLoS biology.

[15]  Richard Durbin,et al.  Sequence analysis Fast and accurate short read alignment with Burrows – Wheeler transform , 2009 .

[16]  R. Wilson,et al.  BreakDancer: An algorithm for high resolution mapping of genomic structural variation , 2009, Nature Methods.

[17]  Kai Ye,et al.  Pindel: a pattern growth approach to detect break points of large deletions and medium sized insertions from paired-end short reads , 2009, Bioinform..

[18]  R. Cartwright Problems and solutions for estimating indel rates and length distributions. , 2009, Molecular biology and evolution.

[19]  E. Eichler,et al.  A Human Genome Structural Variation Sequencing Resource Reveals Insights into Mutational Mechanisms , 2010, Cell.

[20]  A. Gnirke,et al.  High-quality draft assemblies of mammalian genomes from massively parallel sequence data , 2010, Proceedings of the National Academy of Sciences.

[21]  John Wei,et al.  Towards a comprehensive structural variation map of an individual human genome , 2010, Genome Biology.

[22]  E. Eichler,et al.  Characterization of Missing Human Genome Sequences and Copy-number Polymorphic Insertions , 2010, Nature Methods.

[23]  Ira M. Hall,et al.  Genome-wide mapping and assembly of structural variant breakpoints in the mouse genome. , 2010, Genome research.

[24]  D. Altshuler,et al.  A map of human genome variation from population-scale sequencing , 2010, Nature.

[25]  Sharon R Grossman,et al.  Integrating common and rare genetic variation in diverse human populations , 2010, Nature.

[26]  E. Mardis The $1,000 genome, the $100,000 analysis? , 2010, Genome Medicine.

[27]  Ryan E. Mills,et al.  Natural genetic variation caused by small insertions and deletions in the human genome. , 2011, Genome research.

[28]  M. DePristo,et al.  A framework for variation discovery and genotyping using next-generation DNA sequencing data , 2011, Nature Genetics.

[29]  Emmanouil Collab A map of human genome variation from population-scale sequencing , 2011, Nature.

[30]  Nuno A. Fonseca,et al.  Assemblathon 1: a competitive assessment of de novo short read assembly methods. , 2011, Genome research.

[31]  Kenny Q. Ye,et al.  Mapping copy number variation by population scale genome sequencing , 2010, Nature.

[32]  L. Kedes,et al.  The new date, new format, new goals and new sponsor of the Archon Genomics X PRIZE Competition , 2011, Nature Genetics.

[33]  Bradley P. Coe,et al.  Genome structural variation discovery and genotyping , 2011, Nature Reviews Genetics.

[34]  J. Zook,et al.  Genomes in a bottle: creating standard reference materials for genomic variation - why, what and how? , 2011, Genome Biology.

[35]  Thomas M. Keane,et al.  Sequence-based characterization of structural variation in the mouse genome , 2011, Nature.

[36]  R. Durbin,et al.  Dindel: accurate indel calls from short-read data. , 2011, Genome research.

[37]  Bud Mishra,et al.  Reevaluating Assembly Evaluations with Feature Response Curves: GAGE and Assemblathons , 2012, PloS one.

[38]  David A. Patterson,et al.  For better or worse, benchmarks shape a field , 2012, Commun. ACM.

[39]  James Taylor,et al.  Next-generation sequencing data interpretation: enhancing reproducibility and accessibility , 2012, Nature Reviews Genetics.

[40]  Toshiyuki Yamamoto,et al.  CONFLICT OF INTEREST: None declared. , 2013 .

[41]  A. Sivachenko,et al.  Sensitive detection of somatic point mutations in impure and heterogeneous cancer samples , 2013, Nature Biotechnology.