论文信息 - Changepoint Analysis for Efficient Variant Calling - 字舞流文

Changepoint Analysis for Efficient Variant Calling

We present CAGe, a statistical algorithm which exploits high sequence identity between sampled genomes and a reference assembly to streamline the variant calling process. Using a combination of changepoint detection, classification, and online variant detection, CAGe is able to call simple variants quickly and accurately on the 90-95% of a sampled genome which differs little from the reference, while correctly learning the remaining 5-10% that must be processed using more computationally expensive methods. CAGe runs on a deeply sequenced human whole genome sample in approximately 20 minutes, potentially reducing the burden of variant calling by an order of magnitude after one memory-efficient pass over the data.

Yun S. Song | Ameet Talwalkar | Michael I. Jordan | David A. Patterson | Bin Yu | Adam Bloniarz | Jonathan Terhorst | Yun S. Song | Bin Yu | Ameet S. Talwalkar | D. Patterson | Adam Bloniarz | Jonathan Terhorst

[1] E. Lander,et al. Genomic mapping by fingerprinting random clones: a mathematical analysis. , 1988, Genomics.

[2] Elizabeth M. Smigielski,et al. dbSNP: the NCBI database of genetic variation , 2001, Nucleic Acids Res..

[3] Zhongming Zhao,et al. Investigating single nucleotide polymorphism (SNP) density in the human genome and its implications for molecular evolution. , 2003, Gene.

[4] S. Tishkoff,et al. Implications of biogeography of human populations for 'race' and medicine , 2004, Nature Genetics.

[5] Jeffrey D. Scargle,et al. An algorithm for optimal partitioning of data on an interval , 2003, IEEE Signal Processing Letters.

[6] Timothy B. Stockwell,et al. The Diploid Genome Sequence of an Individual Human , 2007, PLoS biology.

[7] Joseph T. Glessner,et al. PennCNV: an integrated hidden Markov model designed for high-resolution copy number variation detection in whole-genome SNP genotyping data. , 2007, Genome research.

[8] Paul Medvedev,et al. Computational methods for discovering structural variation with next-generation sequencing , 2009, Nature Methods.

[9] Gonçalo R. Abecasis,et al. The Sequence Alignment/Map format and SAMtools , 2009, Bioinform..

[10] Richard Durbin,et al. Sequence analysis Fast and accurate short read alignment with Burrows – Wheeler transform , 2009 .

[11] Lior Pachter,et al. Coverage statistics for sequence census methods , 2010, BMC Bioinformatics.

[12] M. DePristo,et al. A framework for variation discovery and genotyping using next-generation DNA sequencing data , 2011, Nature Genetics.

[13] Yufeng Shen,et al. A Hidden Markov Model for Copy Number Variant prediction from whole genome resequencing data , 2011, 2011 IEEE 1st International Conference on Computational Advances in Bio and Medical Sciences (ICCABS).

[14] Markus Hsi-Yang Fritz,et al. Efficient storage of high throughput DNA sequencing data using reference-based compression. , 2011, Genome research.

[15] Richard M. Karp,et al. Faster and More Accurate Sequence Alignment with SNAP , 2011, ArXiv.

[16] Walter L. Ruzzo,et al. Compression of next-generation sequencing reads aided by highly efficient de novo assembly , 2012, Nucleic acids research.

[17] Giovanna Rosone,et al. Large-scale compression of genomic sequence databases with the Burrows-Wheeler transform , 2012, Bioinform..

[18] Lior Pachter,et al. Quantifying uniformity of mapped reads , 2011, Bioinform..

[19] P. Fearnhead,et al. Optimal detection of changepoints with a linear computational cost , 2011, 1101.1438.

[20] J. J. Shen,et al. Change-point model on nonhomogeneous Poisson processes with application in copy number profiling by next-generation DNA sequencing , 2012, 1206.6627.

[21] David G. Knowles,et al. Fast Computation and Applications of Genome Mappability , 2012, PloS one.

[22] N. Popitsch,et al. NGC: lossless and lossy compression of aligned high-throughput sequencing data , 2012, Nucleic acids research.

[23] Yun S. Song,et al. SMASH: A Benchmarking Toolkit for Variant Calling , 2013 .