Gene set analysis methods: a systematic comparison

BackgroundGene set analysis is a valuable tool to summarize high-dimensional gene expression data in terms of biologically relevant sets. This is an active area of research and numerous gene set analysis methods have been developed. Despite this popularity, systematic comparative studies have been limited in scope.MethodsIn this study we present a semi-synthetic simulation study using real datasets in order to test and compare commonly used methods.ResultsA software pipeline, Flexible Algorithm for Novel Gene set Simulation (FANGS) develops simulated data based on a prostate cancer dataset where the KRAS and TGF-β pathways were differentially expressed. The FANGS software is compatible with other datasets and pathways. Comparisons of gene set analysis methods are presented for Gene Set Enrichment Analysis (GSEA), Significance Analysis of Function and Expression (SAFE), sigPathway, and Correlation Adjusted Mean RAnk (CAMERA) methods. All gene set analysis methods are tested using gene sets from the MSigDB knowledge base. The false positive rate and power are estimated and presented for comparison. Recommendations are made for the utility of the default settings of methods and each method’s sensitivity towards various effect sizes.ConclusionsThe results of this study provide empirical guidance to users of gene set analysis methods. The FANGS software is available for researchers for continued methods comparisons.

[1]  F. Wright,et al.  Significance Analysis of Function and Expression , 2005 .

[2]  Paul A Clemons,et al.  The Connectivity Map: Using Gene-Expression Signatures to Connect Small Molecules, Genes, and Disease , 2006, Science.

[3]  Matthew E. Ritchie,et al.  limma powers differential expression analyses for RNA-sequencing and microarray studies , 2015, Nucleic acids research.

[4]  O. J. Dunn Estimation of the Medians for Dependent Variables , 1959 .

[5]  B. Williams,et al.  Mapping and quantifying mammalian transcriptomes by RNA-Seq , 2008, Nature Methods.

[6]  Jason H. Moore,et al.  Pathway analysis of genomic data: concepts, methods, and prospects for future development. , 2012, Trends in genetics : TIG.

[7]  O. J. Dunn Multiple Comparisons among Means , 1961 .

[8]  Zhiping Weng,et al.  Gene set enrichment analysis: performance evaluation and usage guidelines , 2012, Briefings Bioinform..

[9]  G. Smyth,et al.  Microarray background correction: maximum likelihood estimation for the normal–exponential convolution , 2008, Biostatistics.

[10]  J. I The Design of Experiments , 1936, Nature.

[11]  Pablo Tamayo,et al.  Gene set enrichment analysis: A knowledge-based approach for interpreting genome-wide expression profiles , 2005, Proceedings of the National Academy of Sciences of the United States of America.

[12]  Christoph Bock,et al.  Mapping the chemical chromatin reactivation landscape identifies BRD4-TAF1 cross-talk. , 2016, Nature chemical biology.

[13]  Pooja Mittal,et al.  A novel signaling pathway impact analysis , 2009, Bioinform..

[14]  David Sankoff,et al.  Locating rearrangement events in a phylogeny based on highly fragmented assemblies , 2016, BMC Genomics.

[15]  P. Park,et al.  Discovering statistically significant pathways in expression profiling studies. , 2005, Proceedings of the National Academy of Sciences of the United States of America.

[16]  José M Ferro,et al.  TTC7B Emerges as a Novel Risk Factor for Ischemic Stroke Through the Convergence of Several Genome-Wide Approaches , 2012, Journal of cerebral blood flow and metabolism : official journal of the International Society of Cerebral Blood Flow and Metabolism.

[17]  Andrew B. Nobel,et al.  A statistical framework for testing functional categories in microarray data , 2008, 0803.3881.

[18]  D. Hanahan,et al.  Hallmarks of Cancer: The Next Generation , 2011, Cell.

[19]  Sean R. Davis,et al.  NCBI GEO: archive for functional genomics data sets—update , 2012, Nucleic Acids Res..

[20]  Robert M. Brosh,et al.  DNA helicase and helicase–nuclease enzymes with a conserved iron–sulfur cluster , 2012, Nucleic acids research.

[21]  E. Ziegel Permutation, Parametric, and Bootstrap Tests of Hypotheses (3rd ed.) , 2005 .

[22]  Korbinian Strimmer,et al.  BMC Bioinformatics BioMed Central Methodology article A general modular framework for gene set enrichment analysis , 2009 .

[23]  Rafael A Irizarry,et al.  Exploration, normalization, and summaries of high density oligonucleotide array probe level data. , 2003, Biostatistics.

[24]  Alain Viari,et al.  A whole-genome sequence and transcriptome perspective on HER2-positive breast cancers , 2016, Nature Communications.

[25]  Galina Selivanova,et al.  MDM2-dependent downregulation of p21 and hnRNP K provides a switch between apoptosis and growth arrest induced by pharmacologically activated p53. , 2009, Cancer cell.

[26]  Jürgen Böhm,et al.  Metabolomics and transcriptomics identify pathway differences between visceral and subcutaneous adipose tissue in colorectal cancer patients: the ColoCare study. , 2015, The American journal of clinical nutrition.

[27]  A. Donovan,et al.  The RSPO–LGR4/5–ZNRF3/RNF43 module controls liver zonation and size , 2016, Nature Cell Biology.

[28]  Peter Kraft,et al.  Association of Prostate Cancer Risk Variants with Gene Expression in Normal and Tumor Tissue , 2014, Cancer Epidemiology, Biomarkers & Prevention.

[29]  Ksenija Lopandic,et al.  Genome sequence of the filamentous soil fungus Chaetomium cochliodes reveals abundance of genes for heme enzymes from all peroxidase and catalase superfamilies , 2016, BMC Genomics.

[30]  David A. Bennett,et al.  REST and Stress Resistance in Aging and Alzheimer’s Disease , 2014, Nature.

[31]  Atul J. Butte,et al.  Ten Years of Pathway Analysis: Current Approaches and Outstanding Challenges , 2012, PLoS Comput. Biol..

[32]  Terence P. Speed,et al.  A comparison of normalization methods for high density oligonucleotide array data based on variance and bias , 2003, Bioinform..

[33]  Lucie Kučerová,et al.  Slowed aging during reproductive dormancy is reflected in genome-wide transcriptome changes in Drosophila melanogaster , 2016, BMC Genomics.

[34]  Axel Benner,et al.  Phenotypic differentiation does not affect tumorigenicity of primary human colon cancer initiating cells. , 2016, Cancer letters.

[35]  Christopher B. Miller,et al.  Genome-wide analysis of genetic alterations in acute lymphoblastic leukaemia , 2007, Nature.

[36]  Jing Ma,et al.  Network-based pathway enrichment analysis with incomplete network information , 2014, Bioinform..

[37]  Susumu Goto,et al.  Data, information, knowledge and principle: back to metabolism in KEGG , 2013, Nucleic Acids Res..

[38]  B. Fridley,et al.  Self-Contained Gene-Set Analysis of Expression Data: An Evaluation of Existing and Novel Methods , 2010, PloS one.

[39]  Peter Bühlmann,et al.  Analyzing gene expression data in terms of gene sets: methodological issues , 2007, Bioinform..

[40]  Neema Jamshidi,et al.  Transcriptome profiling reveals novel gene expression signatures and regulating transcription factors of TGF β‐induced epithelial‐to‐mesenchymal transition , 2016, Cancer medicine.

[41]  B. Hulsegge,et al.  Pathway analysis of Microarray data , 2006 .

[42]  Rafael A. Irizarry,et al.  A framework for oligonucleotide microarray preprocessing , 2010, Bioinform..

[43]  Andrew B. Nobel,et al.  Significance analysis of functional categories in gene expression studies: a structured permutation approach , 2005, Bioinform..

[44]  Benjamin M. Bolstad,et al.  affy - analysis of Affymetrix GeneChip data at the probe level , 2004, Bioinform..

[45]  J. Castle,et al.  Microarray analysis shows that some microRNAs downregulate large numbers of target mRNAs , 2005, Nature.

[46]  Henryk Maciejewski,et al.  Gene set analysis methods: statistical models and methodological differences , 2013, Briefings Bioinform..

[47]  Fred A Wright,et al.  Microarray analysis of peripheral blood lymphocytes from ALS patients and the SAFE detection of the KEGG ALS pathway , 2011, BMC Medical Genomics.

[48]  G. Smyth,et al.  Camera: a competitive gene set test accounting for inter-gene correlation , 2012, Nucleic acids research.

[49]  Y. Benjamini,et al.  Controlling the false discovery rate: a practical and powerful approach to multiple testing , 1995 .

[50]  I. Nookaew,et al.  Enriching the gene set analysis of genome-wide data by incorporating directionality of gene expression and combining statistical hypotheses and methods , 2013, Nucleic acids research.

[51]  P. Good Permutation, Parametric, and Bootstrap Tests of Hypotheses , 2005 .

[52]  T. Speed,et al.  Summaries of Affymetrix GeneChip probe level data. , 2003, Nucleic acids research.

[53]  Sandrine Dudoit,et al.  More power via graph-structured tests for differential expression of gene networks , 2012, 1206.6980.

[54]  M. Ashburner,et al.  Gene Ontology: tool for the unification of biology , 2000, Nature Genetics.