Hierarchical Bayesian modelling of gene expression time series across irregularly sampled replicates and clusters

BackgroundTime course data from microarrays and high-throughput sequencing experiments require simple, computationally efficient and powerful statistical models to extract meaningful biological signal, and for tasks such as data fusion and clustering. Existing methodologies fail to capture either the temporal or replicated nature of the experiments, and often impose constraints on the data collection process, such as regularly spaced samples, or similar sampling schema across replications.ResultsWe propose hierarchical Gaussian processes as a general model of gene expression time-series, with application to a variety of problems. In particular, we illustrate the method’s capacity for missing data imputation, data fusion and clustering.The method can impute data which is missing both systematically and at random: in a hold-out test on real data, performance is significantly better than commonly used imputation methods. The method’s ability to model inter- and intra-cluster variance leads to more biologically meaningful clusters. The approach removes the necessity for evenly spaced samples, an advantage illustrated on a developmental Drosophila dataset with irregular replications.ConclusionThe hierarchical Gaussian process model provides an excellent statistical basis for several gene-expression time-series tasks. It has only a few additional parameters over a regular GP, has negligible additional complexity, is easily implemented and can be integrated into several existing algorithms. Our experiments were implemented in python, and are available from the authors’ website: http://staffwww.dcs.shef.ac.uk/people/J.Hensman/.

[1]  Padhraic Smyth,et al.  Identification of hair cycle-associated genes from time-course gene expression profile data by using replicate variance , 2004, Proc. Natl. Acad. Sci. USA.

[2]  Katherine A. Heller,et al.  Bayesian hierarchical clustering , 2005, ICML.

[3]  Katherine A. Heller,et al.  Randomized algorithms for fast Bayesian hierarchical clustering , 2005 .

[4]  Neil D. Lawrence,et al.  A Simple Approach to Ranking Differentially Expressed Gene Expression Time Courses through Gaussian Process Regression , 2011, BMC Bioinformatics.

[5]  Paul D. W. Kirk,et al.  Bayesian hierarchical clustering for microarray time series data with replicates and outlier measurements , 2011, BMC Bioinformatics.

[6]  Adrian E. Raftery,et al.  MCLUST: Software for Model-Based Cluster Analysis , 1999 .

[7]  Bart De Moor,et al.  BioMart and Bioconductor: a powerful link between biological databases and microarray data analysis , 2005, Bioinform..

[8]  Dave T. Gerrard,et al.  Gene expression divergence recapitulates the developmental hourglass model , 2010, Nature.

[9]  Ka Yee Yeung,et al.  Bayesian mixture model based clustering of replicated microarray data , 2004, Bioinform..

[10]  Ziv Bar-Joseph,et al.  Clustering short time series gene expression data , 2005, ISMB.

[11]  Christian P. Robert,et al.  On Bayesian Data Analysis , 2010, 1001.4656.

[12]  Guy N. Brock,et al.  clValid , an R package for cluster validation , 2008 .

[13]  Paul Pavlidis,et al.  Gene Ontology term overlap as a measure of gene functional similarity , 2008, BMC Bioinformatics.

[14]  M. Barenco,et al.  Ranked prediction of p53 targets using hidden variable dynamic modeling , 2006, Genome Biology.

[15]  Padhraic Smyth,et al.  Estimating replicate time shifts using Gaussian process regression , 2010, Bioinform..

[16]  P. Bork,et al.  Identification of tightly regulated groups of genes during Drosophila melanogaster embryogenesis , 2007, Molecular systems biology.

[17]  David B. Dunson,et al.  Bayesian Data Analysis , 2010 .

[18]  Tom M. Mitchell,et al.  Continuous hidden process model for time series expression experiments , 2007, ISMB/ECCB.

[19]  M. Ashburner,et al.  Systematic determination of patterns of gene expression during Drosophila embryogenesis , 2002, Genome Biology.

[20]  Zoubin Ghahramani,et al.  Bayesian correlated clustering to integrate multiple datasets , 2012, Bioinform..

[21]  Kui Wang,et al.  A Mixture model with random-effects components for clustering correlated gene-expression profiles , 2006, Bioinform..

[22]  Neil D. Lawrence,et al.  Probabilistic inference of transcription factor concentrations and gene-specific regulatory activities , 2006, Bioinform..

[23]  David B. Dunson,et al.  Bayesian Nonparametrics: Nonparametric Bayes applications to biostatistics , 2010 .

[24]  Carl E. Rasmussen,et al.  Gaussian processes for machine learning , 2005, Adaptive computation and machine learning.

[25]  Shin Ishii,et al.  A Bayesian missing value estimation method for gene expression profile data , 2003, Bioinform..

[26]  Michael Ruogu Zhang,et al.  Comprehensive identification of cell cycle-regulated genes of the yeast Saccharomyces cerevisiae by microarray hybridization. , 1998, Molecular biology of the cell.

[27]  Neil D. Lawrence,et al.  Learning and Inference in Computational Systems Biology , 2010, Computational molecular biology.

[28]  Neil D. Lawrence,et al.  Gaussian process modelling of latent chemical species: applications to inferring transcription factor activities , 2008, ECCB.

[29]  Zoubin Ghahramani,et al.  Accelerating Bayesian Hierarchical Clustering of Time Series Data with a Randomised Algorithm , 2013, PloS one.

[30]  Zoubin Ghahramani,et al.  A Robust Bayesian Two-Sample Test for Detecting Intervals of Differential Gene Expression in Microarray Time Series , 2009, RECOMB.

[31]  Russ B. Altman,et al.  Missing value estimation methods for DNA microarrays , 2001, Bioinform..

[32]  Antti Honkela,et al.  Model-based method for transcription factor target identification with limited data , 2010, Proceedings of the National Academy of Sciences.

[33]  Zoubin Ghahramani,et al.  A Bayesian approach to reconstructing genetic regulatory networks with hidden factors , 2005, Bioinform..

[34]  Martin Straume,et al.  DNA Microarray Time Series Analysis: Automated Statistical Assessment of Circadian Rhythms in Gene Expression Patterning , 2004, Numerical Computer Methods, Part D.