An empirical Bayes approach to inferring large-scale gene association networks

MOTIVATION Genetic networks are often described statistically using graphical models (e.g. Bayesian networks). However, inferring the network structure offers a serious challenge in microarray analysis where the sample size is small compared to the number of considered genes. This renders many standard algorithms for graphical models inapplicable, and inferring genetic networks an 'ill-posed' inverse problem. METHODS We introduce a novel framework for small-sample inference of graphical models from gene expression data. Specifically, we focus on the so-called graphical Gaussian models (GGMs) that are now frequently used to describe gene association networks and to detect conditionally dependent genes. Our new approach is based on (1) improved (regularized) small-sample point estimates of partial correlation, (2) an exact test of edge inclusion with adaptive estimation of the degree of freedom and (3) a heuristic network search based on false discovery rate multiple testing. Steps (2) and (3) correspond to an empirical Bayes estimate of the network topology. RESULTS Using computer simulations, we investigate the sensitivity (power) and specificity (true negative rate) of the proposed framework to estimate GGMs from microarray data. This shows that it is possible to recover the true network topology with high accuracy even for small-sample datasets. Subsequently, we analyze gene expression data from a breast cancer tumor study and illustrate our approach by inferring a corresponding large-scale gene association network for 3883 genes.

[1]  David J. C. MacKay,et al.  Information Theory, Inference, and Learning Algorithms , 2004, IEEE Transactions on Information Theory.

[2]  D. Pe’er,et al.  Module networks: identifying regulatory modules and their condition-specific regulators from gene expression data , 2003, Nature Genetics.

[3]  Hua Lin,et al.  Quantifying reproducibility for differential proteomics: noise analysis for protein liquid chromatography-mass spectrometry of human serum , 2004, Bioinform..

[4]  N. Wermuth,et al.  Tests of Linearity, Multivariate Normality and the Adequacy of Linear Scores , 1994 .

[5]  A. Barabasi,et al.  Network biology: understanding the cell's functional organization , 2004, Nature Reviews Genetics.

[6]  Gérard Roizès,et al.  MLL3, a new human member of the TRX/MLL gene family, maps to 7q36, a chromosome region frequently deleted in myeloid leukaemia. , 2002, Gene.

[7]  B. Efron Robbins, Empirical Bayes, And Microarrays , 2001 .

[8]  Leo Breiman,et al.  Bagging Predictors , 1996, Machine Learning.

[9]  Y. Benjamini,et al.  On the Adaptive Control of the False Discovery Rate in Multiple Testing With Independent Statistics , 2000 .

[10]  Kalpathi R. Subramanian,et al.  Interactive Analysis of Gene Interactions Using Graphical gaussian model , 2003, BIOKDD.

[11]  M. Drton,et al.  Model selection for Gaussian concentration graphs , 2004 .

[12]  John D. Storey,et al.  Statistical significance for genomewide studies , 2003, Proceedings of the National Academy of Sciences of the United States of America.

[13]  Michael T. Fisher,et al.  Targeting CB2 cannabinoid receptors as a novel therapy to treat malignant lymphoblastic disease. , 2002, Blood.

[14]  John D. Storey,et al.  Empirical Bayes Analysis of a Microarray Experiment , 2001 .

[15]  Jesper Tegnér,et al.  Reverse engineering gene networks using singular value decomposition and robust regression , 2002, Proceedings of the National Academy of Sciences of the United States of America.

[16]  Suzanne Bakken,et al.  Formal nursing terminology systems: a means to an end , 2002, J. Biomed. Informatics.

[17]  Y. Benjamini,et al.  Controlling the false discovery rate: a practical and powerful approach to multiple testing , 1995 .

[18]  MendesPedro,et al.  Discovery of meaningful associations in genomic data using partial correlation coefficients , 2004 .

[19]  H Toh,et al.  System for Automatically Inferring a Genetic Netwerk from Expression Profiles , 2002, Journal of biological physics.

[20]  John D. Storey A direct approach to false discovery rates , 2002 .

[21]  Dirk Husmeier,et al.  Sensitivity and specificity of inferring genetic regulatory interactions from microarray experiments with dynamic Bayesian networks , 2003, Bioinform..

[22]  Marcel J. T. Reinders,et al.  ROBUST GENETIC NETWORK MODELING BY ADDING NOISY DATA , 2001 .

[23]  Satoru Miyano,et al.  Inferring Gene Regulatory Networks from Time-Ordered Gene Expression Data of Bacillus Subtilis Using Differential Equations , 2002, Pacific Symposium on Biocomputing.

[24]  B. Wasylyk,et al.  The Ets family of transcription factors. , 1993, European journal of biochemistry.

[25]  K. Hoffmann,et al.  Development of the Optokinetic Response in Macaques , 2003, Annals of the New York Academy of Sciences.

[26]  Robert P. W. Duin,et al.  Expected classification error of the Fisher linear classifier with pseudo-inverse covariance matrix , 1998, Pattern Recognit. Lett..

[27]  Nir Friedman,et al.  Being Bayesian About Network Structure. A Bayesian Approach to Structure Discovery in Bayesian Networks , 2004, Machine Learning.

[28]  Eivind Hovig,et al.  MGraph: graphical models for microarray data analysis , 2003, Bioinform..

[29]  Zoubin Ghahramani,et al.  Modeling T-cell activation using gene expression profiling and state-space models , 2004, Bioinform..

[30]  S. Rafii,et al.  Splitting vessels: Keeping lymph apart from blood , 2003, Nature Medicine.

[31]  Hiroyuki Toh,et al.  Inference of a genetic network by a combined approach of cluster analysis and graphical Gaussian modeling , 2002, Bioinform..

[32]  R. Delwel,et al.  Identification, Characterization, and Function of a Novel Oncogene , 2003, Annals of the New York Academy of Sciences.

[33]  Eric R. Ziegel,et al.  The Elements of Statistical Learning , 2003, Technometrics.

[34]  P. Waddell,et al.  Cluster inference methods and graphical models evaluated on NCI60 microarray gene expression data. , 2000, Genome informatics. Workshop on Genome Informatics.

[35]  Pat Langley,et al.  Revising regulatory networks: from expression data to linear causal models , 2002, J. Biomed. Informatics.

[36]  R. Tibshirani,et al.  Efficient quadratic regularization for expression arrays. , 2004, Biostatistics.

[37]  T. Speed,et al.  Summaries of Affymetrix GeneChip probe level data. , 2003, Nucleic acids research.

[38]  Michael I. Jordan Graphical Models , 1998 .

[39]  Cristina Blázquez,et al.  Inhibition of skin tumor growth and angiogenesis in vivo by activation of cannabinoid receptors. , 2003, The Journal of clinical investigation.

[40]  J. Friedman Regularized Discriminant Analysis , 1989 .

[41]  Robert P. W. Duin,et al.  Bagging, Boosting and the Random Subspace Method for Linear Classifiers , 2002, Pattern Analysis & Applications.

[42]  H. Hotelling New Light on the Correlation Coefficient and its Transforms , 1953 .

[43]  R. Kohn,et al.  Efficient estimation of covariance selection models , 2003 .

[44]  Robert Tibshirani,et al.  Statistical Significance for Genome-Wide Experiments , 2003 .

[45]  Nir Friedman,et al.  Inferring Cellular Networks Using Probabilistic Graphical Models , 2004, Science.

[46]  H Kishino,et al.  Correspondence analysis of genes and tissue types and finding genetic links from microarray data. , 2000, Genome informatics. Workshop on Genome Informatics.

[47]  D. Edwards Introduction to graphical modelling , 1995 .

[48]  B. Efron Large-Scale Simultaneous Hypothesis Testing , 2004 .

[49]  Patrik D'haeseleer,et al.  Genetic network inference: from co-expression clustering to reverse engineering , 2000, Bioinform..

[50]  Michal Linial,et al.  Using Bayesian Networks to Analyze Expression Data , 2000, J. Comput. Biol..

[51]  M. West,et al.  Sparse graphical models for exploring gene expression data , 2004 .

[52]  Michael C. Horsch,et al.  Dynamic Bayesian networks , 1990 .

[53]  Philip M Kluin,et al.  LAF4, an AF4‐related gene, is fused to MLL in infant acute lymphoblastic leukemia , 2002, Genes, chromosomes & cancer.

[54]  R. Spang,et al.  Predicting the clinical status of human breast cancer by using gene expression profiles , 2001, Proceedings of the National Academy of Sciences of the United States of America.

[55]  Ruud Delwel,et al.  Identification, characterization, and function of a novel oncogene: the peripheral cannabinoid receptor Cb2. , 2003, Annals of the New York Academy of Sciences.

[56]  Alberto de la Fuente,et al.  Discovery of meaningful associations in genomic data using partial correlation coefficients , 2004, Bioinform..

[57]  R. Dahlhaus Graphical interaction models for multivariate time series1 , 2000 .

[58]  J. N. R. Jeffers,et al.  Graphical Models in Applied Multivariate Statistics. , 1990 .

[59]  Stuart J. Russell,et al.  Dynamic bayesian networks: representation, inference and learning , 2002 .

[60]  R. Penrose A Generalized inverse for matrices , 1955 .

[61]  Chiara Sabatti,et al.  Network component analysis: Reconstruction of regulatory signals in biological systems , 2003, Proceedings of the National Academy of Sciences of the United States of America.

[62]  B. Seed,et al.  Molecular cloning of two CD7 (T‐cell leukemia antigen) cDNAs by a COS cell expression system. , 1987, The EMBO journal.