A comprehensive assessment of RNA-seq accuracy, reproducibility and information content by the Sequencing Quality Control consortium

We present primary results from the Sequencing Quality Control (SEQC) project, coordinated by the US Food and Drug Administration. Examining Illumina HiSeq, Life Technologies SOLiD and Roche 454 platforms at multiple laboratory sites using reference RNA samples with built-in controls, we assess RNA sequencing (RNA-seq) performance for junction discovery and differential expression profiling and compare it to microarray and quantitative PCR (qPCR) data using complementary metrics. At all sequencing depths, we discover unannotated exon-exon junctions, with >80% validated by qPCR. We find that measurements of relative expression are accurate and reproducible across sites and platforms if specific filters are used. In contrast, RNA-seq and microarrays do not provide accurate absolute measurements, and gene-specific biases are observed for all examined platforms, including qPCR. Measurement performance depends on the platform and data analysis pipeline, and variation is large for transcript-level profiling. The complete SEQC data sets, comprising >100 billion reads (10Tb), provide unique resources for evaluating RNA-seq analyses for clinical and regulatory settings.

David P. Kreil | Todd M. Smith | Thomas M. Blomquist | Paweł P. Łabaj | Francisco J. Lopez | S. Hochreiter | May D. Wang | Wenwei Zhang | Meihua Gong | Yanyan Zhang | Simon M Lin | Djork-Arné Clevert | G. Schroth | P. Sykacek | C. Furlanello | C. Mason | Wei Wang | E. Thompson | S. Letovsky | Tieliu Shi | Yutaka Suzuki | Leming Shi | W. Jones | J. Willey | R. Setterquist | W. Tong | R. Jensen | Charles D. Johnson | J. Thierry-Mieg | Charles Wang | W. Bao | T. Chu | H. Fang | J. Fuscoe | W. Ge | Lei Guo | H. Hong | Quan-Zhen Li | N. Mei | B. Ning | R. Perkins | F. Qian | F. Staedtler | Z. Su | D. Thierry-Mieg | S. Walker | R. Wolfinger | J. Hadfield | S. Lababidi | Susanna-Assunta Sansone | E. Stupka | O. Stegle | P. Rocca-Serra | W. Xiao | Min Jian | Sheng Li | W. Shi | Johnf . Thompson | Weihong Xu | R. Kelly | Joshua Xu | A. Conesa | Hanlin Gao | N. Jafari | Yang Liao | Fei Lu | E. Oakeley | Zhiyu Peng | C. Praul | Javier Santoyo-Lopez | A. Scherer | G. Smyth | Xinzhen Tan | J. Vandesompele | Jian Wang | J. Zavadil | S. Auerbach | H. Binder | T. Blomquist | M. Brilliant | P. Bushel | Weimin Cai | J. Catalano | Ching-Wei Chang | Tao Chen | Geng Chen | Rong Chen | M. Chierici | Youping Deng | A. Derti | V. Devanarayan | Zirui Dong | J. Dopazo | T. Du | Yongxiang Fang | M. Fasold | Anita Fernandez | M. Fischer | P. Furió-Tarí | Florian Caimet | S. Gaj | Jorge A Gandara | Huan Gao | Y. Gondo | Binsheng Gong | Zhuolin Gong | B. Green | Chao Guo | Li Guo | J. Hellemans | Meiwen Jia | S. Kay | J. Kleinjans | S. Levy | Li Li | P. Li | Yan Li | Haiqing Li | Jianying Li | Shiyong Li | Xin-xin Lu | Heng Luo | Xiwen Ma | J. Meehan | D. Megherbi | Bing Mu | A. Pandey | Javier Perez-Florido | R. Peters | J. Phan | M. Pirooznia | T. Qing | L. Rainbow | Laure Sambourg | S. Schwartz | Ruchir R. Shah | Jie Shen | N. Stralis-Pavese | Lee Szkotnicki | M. Tinning | Bimeng Tu | J. V. Delft | Alicia Vela-Boza | E. Venturini | Liqing Wan | Jinhui Wang | Jun Wang | E. Wieben | P. Wu | J. Xuan | Yong Yang | Zhan Ye | Ye Yin | Ying Yu | Yate-Ching Yuan | John Zhang | Kecheng Zhang | Wenqian Zhang | Chen Zhao | Yuanting Zheng | Yiming Zhou | Paul Zumbo | Z. Dong | May D Wang | J. Thompson | J. Perez-Florido | Cesare Furlanello | P. Zumbo | J. Gandara | Jennifer G. Catalano | Lee T. Szkotnicki | Peng Li | Wenzhong Xiao | Wei Shi | May D. Wang | Reagan J. Kelly | Wenjun Bao | Sepp Hochreiter | J. Pérez-Florido

[1]  Martin Vingron,et al.  Variance stabilization applied to microarray data calibration and to the quantification of differential expression , 2002, ISMB.

[2]  Rafael A. Irizarry,et al.  A Model-Based Background Adjustment for Oligonucleotide Expression Arrays , 2004 .

[3]  R. Myers,et al.  Evolving gene/transcript definitions significantly alter the interpretation of GeneChip data , 2005, Nucleic acids research.

[4]  The External Rna Controls Consortium The External RNA Controls Consortium: a progress report , 2005 .

[5]  J. Thierry-Mieg,et al.  AceView: a comprehensive cDNA-supported gene and transcripts annotation , 2006, Genome Biology.

[6]  Klaus Obermayer,et al.  A new summarization method for affymetrix probe level data , 2006, Bioinform..

[7]  Maqc Consortium The MicroArray Quality Control (MAQC) project shows inter- and intraplatform reproducibility of gene expression measurements , 2006, Nature Biotechnology.

[8]  Leming Shi,et al.  Using RNA sample titrations to assess microarray platform performance and normalization techniques , 2006, Nature Biotechnology.

[9]  M. Stephens,et al.  RNA-seq: an assessment of technical reproducibility and comparison with gene expression arrays. , 2008, Genome research.

[10]  Eric T. Wang,et al.  Alternative Isoform Regulation in Human Tissue Transcriptomes , 2008, Nature.

[11]  H. D. Vanguilder,et al.  Twenty-five years of quantitative PCR for gene expression analysis. , 2008, BioTechniques.

[12]  Joshua M. Korn,et al.  Comprehensive genomic characterization defines human glioblastoma genes and core pathways , 2008, Nature.

[13]  B. Williams,et al.  Mapping and quantifying mammalian transcriptomes by RNA-Seq , 2008, Nature Methods.

[14]  M. Gerstein,et al.  Comparison and calibration of transcriptome data from RNA-Seq and tiling arrays , 2010, BMC Genomics.

[15]  Carl E. Rasmussen,et al.  Gaussian processes for machine learning , 2005, Adaptive computation and machine learning.

[16]  M. Robinson,et al.  A scaling normalization method for differential expression analysis of RNA-seq data , 2010, Genome Biology.

[17]  Ivo L. Hofacker,et al.  Hybridization thermodynamics of NimbleGen Microarrays , 2010, BMC Bioinformatics.

[18]  T. Fennell,et al.  Targeted next-generation sequencing of a cancer transcriptome enhances detection of sequence variants and novel fusion transcripts , 2009, Genome Biology.

[19]  Gos Micklem,et al.  The impact of quantitative optimization of hybridization conditions on gene expression analysis , 2011, BMC Bioinformatics.

[20]  Peter F. Stadler,et al.  G-stack modulated probe intensities on expression arrays - sequence corrections and signal calibration , 2010, BMC Bioinformatics.

[21]  Gary D Bader,et al.  International network of cancer genome projects , 2010, Nature.

[22]  Joseph K. Pickrell,et al.  Noisy Splicing Drives mRNA Isoform Diversity in Human Cells , 2010, PLoS genetics.

[23]  T. Fennell,et al.  Analyzing and minimizing PCR amplification bias in Illumina sequencing libraries , 2011, Genome Biology.

[24]  Mark D. Robinson,et al.  edgeR: a Bioconductor package for differential expression analysis of digital gene expression data , 2009, Bioinform..

[25]  Mingyao Li,et al.  RNA-sequence analysis of human B-cells. , 2011, Genome research.

[26]  H. Steven Wiley,et al.  Characterization and improvement of RNA-Seq precision in quantitative transcript expression profiling , 2011, Bioinform..

[27]  D. Levy,et al.  A systematic comparison and evaluation of high density exon arrays and RNA-seq technology used to unravel the peripheral blood transcriptome of sickle cell disease , 2012, BMC Medical Genomics.

[28]  Lucian Ilie,et al.  SHRiMP2: Sensitive yet Practical Short Read Mapping , 2011, Bioinform..

[29]  Dan Wang,et al.  A comparison of RNA-Seq and high-density exon array for detecting differential gene expression between closely related species , 2010, Nucleic Acids Res..

[30]  John D. Storey,et al.  Human transcriptome array for high-throughput clinical studies , 2011, Proceedings of the National Academy of Sciences.

[31]  Kenneth K. Lopiano,et al.  RNA-seq: technical variability and sampling , 2011, BMC Genomics.

[32]  Y. Benjamini,et al.  Summarizing and correcting the GC content bias in high-throughput sequencing , 2012, Nucleic acids research.

[33]  Tatiana A. Tatusova,et al.  NCBI Reference Sequences (RefSeq): current status, new features and genome annotation policy , 2011, Nucleic Acids Res..

[34]  Avner Schlessinger,et al.  ICAHN SCHOOL OF MEDICINE AT MOUNT SINAI , 2012 .

[35]  Cole Trapnell,et al.  TopHat2: accurate alignment of transcriptomes in the presence of insertions, deletions and gene fusions , 2013, Genome Biology.

[36]  Bronwen L. Aken,et al.  GENCODE: The reference human genome annotation for The ENCODE Project , 2012, Genome research.

[37]  Nadav S. Bar,et al.  Landscape of transcription in human cells , 2012, Nature.

[38]  Antti Honkela,et al.  Identifying differentially expressed transcripts from RNA-seq data with biological variation , 2011, Bioinform..

[39]  W. Shi,et al.  The Subread aligner: fast, accurate and scalable read mapping by seed-and-vote , 2013, Nucleic acids research.

[40]  David G Hendrickson,et al.  Differential analysis of gene regulation at transcript resolution with RNA-seq , 2012, Nature Biotechnology.

[41]  Thomas R. Gingeras,et al.  STAR: ultrafast universal RNA-seq aligner , 2013, Bioinform..

[42]  Leming Shi,et al.  mRNA enrichment protocols determine the quantification characteristics of external RNA spike-in controls in RNA-Seq studies , 2013, Science China Life Sciences.

[43]  C. Mason,et al.  Comprehensive evaluation of differential gene expression analysis methods for RNA-seq data , 2013, Genome Biology.

[44]  Nicolas Servant,et al.  A comprehensive evaluation of normalization methods for Illumina high-throughput RNA sequencing data analysis , 2013, Briefings Bioinform..

[45]  Jeroen F. J. Laros,et al.  Reproducibility of high-throughput mRNA and small RNA sequencing across laboratories , 2013, Nature Biotechnology.

[46]  Charity W. Law,et al.  voom: precision weights unlock linear model analysis tools for RNA-seq read counts , 2014, Genome Biology.

[47]  Mingyao Li,et al.  Evaluating the Impact of Sequencing Depth on Transcriptome Profiling in Human Adipose , 2013, PloS one.

[48]  David P. Kreil,et al.  Cross-platform ultradeep transcriptomic profiling of human reference RNA samples by RNA-Seq , 2014, Scientific Data.

[49]  David P. Kreil,et al.  Assessing technical performance in differential gene expression experiments with external spike-in RNA control ratio mixtures , 2014, Nature Communications.

[50]  Sheng Li,et al.  Multi-platform assessment of transcriptome profiling using RNA-seq in the ABRF next-generation sequencing study , 2014, Nature Biotechnology.

[51]  David P. Kreil,et al.  The concordance between RNA-seq and microarray data depends on chemical treatment and transcript abundance , 2014, Nature Biotechnology.

[52]  C. Mason,et al.  A rat RNA-Seq transcriptomic BodyMap across 11 organs and 4 developmental stages , 2014, Nature Communications.

[53]  Wei Shi,et al.  Detecting and correcting systematic variation in large-scale RNA sequencing data , 2014, Nature Biotechnology.

[54]  Wei Shi,et al.  featureCounts: an efficient general purpose program for assigning sequence reads to genomic features , 2013, Bioinform..