Measuring the Effect of Inter-Study Variability on Estimating Prediction Error

Background The biomarker discovery field is replete with molecular signatures that have not translated into the clinic despite ostensibly promising performance in predicting disease phenotypes. One widely cited reason is lack of classification consistency, largely due to failure to maintain performance from study to study. This failure is widely attributed to variability in data collected for the same phenotype among disparate studies, due to technical factors unrelated to phenotypes (e.g., laboratory settings resulting in “batch-effects”) and non-phenotype-associated biological variation in the underlying populations. These sources of variability persist in new data collection technologies. Methods Here we quantify the impact of these combined “study-effects” on a disease signature’s predictive performance by comparing two types of validation methods: ordinary randomized cross-validation (RCV), which extracts random subsets of samples for testing, and inter-study validation (ISV), which excludes an entire study for testing. Whereas RCV hardwires an assumption of training and testing on identically distributed data, this key property is lost in ISV, yielding systematic decreases in performance estimates relative to RCV. Measuring the RCV-ISV difference as a function of number of studies quantifies influence of study-effects on performance. Results As a case study, we gathered publicly available gene expression data from 1,470 microarray samples of 6 lung phenotypes from 26 independent experimental studies and 769 RNA-seq samples of 2 lung phenotypes from 4 independent studies. We find that the RCV-ISV performance discrepancy is greater in phenotypes with few studies, and that the ISV performance converges toward RCV performance as data from additional studies are incorporated into classification. Conclusions We show that by examining how fast ISV performance approaches RCV as the number of studies is increased, one can estimate when “sufficient” diversity has been achieved for learning a molecular signature likely to translate without significant loss of accuracy to new clinical settings.

[1]  Tieliu Shi,et al.  A comparison of batch effect removal methods for enhancement of prediction performance using MAQC-II microarray gene expression data , 2010, The Pharmacogenomics Journal.

[2]  Thomas R. Gingeras,et al.  STAR: ultrafast universal RNA-seq aligner , 2013, Bioinform..

[3]  Jeroen F. J. Laros,et al.  Reproducibility of high-throughput mRNA and small RNA sequencing across laboratories , 2013, Nature Biotechnology.

[4]  G. Omenn,et al.  Evolution of Translational Omics: Lessons Learned and the Path Forward , 2013 .

[5]  Gary A. Churchill,et al.  Randomization in Laboratory Procedure Is Key to Obtaining Reproducible Microarray Results , 2008, PloS one.

[6]  N. Price,et al.  Probabilistic integrative modeling of genome-scale metabolic and regulatory networks in Escherichia coli and Mycobacterium tuberculosis , 2010, Proceedings of the National Academy of Sciences.

[7]  Patrick Cahan,et al.  Meta-analysis of microarray results: challenges, opportunities, and recommendations for standardization. , 2007, Gene.

[8]  James J. Chen,et al.  Reproducibility of microarray data: a further analysis of microarray quality control (MAQC) data , 2007, BMC Bioinformatics.

[9]  Lei Liu,et al.  A study of inter-lab and inter-platform agreement of DNA microarray data , 2005, BMC Genomics.

[10]  K. Sirotkin,et al.  The NCBI dbGaP database of genotypes and phenotypes , 2007, Nature Genetics.

[11]  R. Berkowitz,et al.  Osteopontin as a potential diagnostic biomarker for ovarian cancer. , 2002, JAMA.

[12]  Dennis B. Troup,et al.  NCBI GEO: archive for functional genomics data sets—10 years on , 2010, Nucleic Acids Res..

[13]  Hideaki Sugawara,et al.  The Sequence Read Archive , 2010, Nucleic Acids Res..

[14]  M. Gerstein,et al.  RNA-Seq: a revolutionary tool for transcriptomics , 2009, Nature Reviews Genetics.

[15]  Andreas Scherer,et al.  Batch Effects and Noise in Microarray Experiments: Sources and Solutions , 2009 .

[16]  David M. Simcha,et al.  Tackling the widespread and critical impact of batch effects in high-throughput data , 2010, Nature Reviews Genetics.

[17]  Jaeyun Sung,et al.  Molecular signatures from omics data: From chaos to consensus , 2012, Biotechnology journal.

[18]  Seungbok Lee,et al.  A transforming KIF5B and RET gene fusion in lung adenocarcinoma revealed from whole-genome and transcriptome sequencing. , 2012, Genome research.

[19]  Matthew N. McCall,et al.  The Gene Expression Barcode: leveraging public data repositories to begin cataloging the human and murine transcriptomes , 2010, Nucleic Acids Res..

[20]  Pora Kim,et al.  A High-Dimensional, Deep-Sequencing Study of Lung Adenocarcinoma in Female Never-Smokers , 2013, PloS one.

[21]  Joshua M. Korn,et al.  Comprehensive genomic characterization defines human glioblastoma genes and core pathways , 2008, Nature.

[22]  Koji Kadota,et al.  Ranking differentially expressed genes from Affymetrix gene expression data: methods with reproducibility, sensitivity, and specificity , 2008, Algorithms for Molecular Biology.

[23]  Vladimir Vapnik,et al.  An overview of statistical learning theory , 1999, IEEE Trans. Neural Networks.

[24]  Chih-Jen Lin,et al.  Combining SVMs with Various Feature Selection Strategies , 2006, Feature Extraction.

[25]  R. Vossen,et al.  Can subtle changes in gene expression be consistently detected with different microarray platforms? , 2008, BMC Genomics.

[26]  Alice C Young,et al.  Massively differential bias between two widely used Illumina library preparation methods for small RNA sequencing , 2013, bioRxiv.

[27]  Jaeyun Sung,et al.  Multi-study Integration of Brain Cancer Transcriptomes Reveals Organ-Level Molecular Signatures , 2013, PLoS Comput. Biol..

[28]  Daniel Q. Naiman,et al.  Statistical Applications in Genetics and Molecular Biology Classifying Gene Expression Profiles from Pairwise mRNA Comparisons , 2011 .

[29]  Maqc Consortium The MicroArray Quality Control (MAQC) project shows inter- and intraplatform reproducibility of gene expression measurements , 2006, Nature Biotechnology.

[30]  Xiaodong Wang,et al.  Binarization of microarray data on the basis of a mixture model. , 2003, Molecular cancer therapeutics.

[31]  Cheng Li,et al.  Adjusting batch effects in microarray expression data using empirical Bayes methods. , 2007, Biostatistics.

[32]  J. Mesirov,et al.  Molecular classification of cancer: class discovery and class prediction by gene expression monitoring. , 1999, Science.

[33]  Ilya Shmulevich,et al.  Binary analysis and optimization-based normalization of gene expression data , 2002, Bioinform..

[34]  Chunyu Liu,et al.  Removing Batch Effects in Analysis of Expression Microarray Data: An Evaluation of Six Batch Adjustment Methods , 2011, PloS one.

[35]  A. Scherer Batch Effects and Noise in Microarray Experiments , 2009 .

[36]  Ibrahim Emam,et al.  ArrayExpress update—an archive of microarray and high-throughput sequencing-based functional genomics experiments , 2010, Nucleic Acids Res..

[37]  M. Stephens,et al.  RNA-seq: an assessment of technical reproducibility and comparison with gene expression arrays. , 2008, Genome research.

[38]  Yeul-Hong Kim,et al.  Identification of potential lung cancer biomarkers using an in vitro carcinogenesis model , 2008, Experimental & Molecular Medicine.

[39]  S. Dodig,et al.  Exhaled breath condensate: a new method for lung disease diagnosis , 2007, Clinical chemistry and laboratory medicine.

[40]  Chih-Jen Lin,et al.  LIBSVM: A library for support vector machines , 2011, TIST.

[41]  R. Tibshirani,et al.  Disease signatures are robust across tissues and experiments , 2009, Molecular systems biology.

[42]  Cory C. Funk,et al.  Systems approaches to molecular cancer diagnostics. , 2010, Discovery medicine.

[43]  John Quackenbush,et al.  Multiple-laboratory comparison of microarray platforms , 2005, Nature Methods.

[44]  Eckart Meese,et al.  Identification of lung cancer with high sensitivity and specificity by blood testing , 2010, Respiratory research.

[45]  Hugues Bersini,et al.  Batch effect removal methods for microarray gene expression data integration: a survey , 2013, Briefings Bioinform..

[46]  C. Ball,et al.  Repeatability of published microarray gene expression analyses , 2009, Nature Genetics.

[47]  Rafael A. Irizarry,et al.  A Model-Based Background Adjustment for Oligonucleotide Expression Arrays , 2004 .

[48]  Corrigendum: Reproducibility of high-throughput mRNA and small RNA sequencing across laboratories , 2014, Nature Biotechnology.

[49]  Douglas G Altman,et al.  Key Issues in Conducting a Meta-Analysis of Gene Expression Microarray Datasets , 2008, PLoS medicine.

[50]  Donald Geman,et al.  Large-scale integration of cancer microarray data identifies a robust common cancer signature , 2007, BMC Bioinformatics.

[51]  R. Irizarry,et al.  A gene expression bar code for microarray data , 2007, Nature Methods.

[52]  D. DeMeo,et al.  Molecular biomarkers for quantitative and discrete COPD phenotypes. , 2009, American journal of respiratory cell and molecular biology.