LOGOS: a modular Bayesian model for de novo motif detection

The complexity of the global organization and internal structures of motifs in higher eukaryotic organisms raises significant challenges for motif detection techniques. To achieve successful de novo motif detection it is necessary to model the complex dependencies within and among motifs and incorporate biological prior knowledge. In this paper, we present LOGOS, an integrated LOcal and GlObal motif Sequence model for biopolymer sequences, which provides a principled framework for developing, modularizing, extending and computing expressive motif models for complex biopolymer sequence analysis. LOGOS consists of two interacting submodels: HMDM, a local alignment model capturing biological prior knowledge and positional dependence within the motif local structure; and HMM, a global motif distribution model modeling frequencies and dependencies of motif occurrences. Model parameters can be fit using training motifs within an empirical Bayesian framework. A variational EM algorithm is developed for de novo motif detection. LOGOS improves over existing models that ignore biological priors and dependencies in motif structures and motif occurrences, and demonstrates superior performance on both semirealistic test data and cis-regulatory sequences from yeast and Drosophila sequences with regard to sensitivity, specificity, flexibility and extensibility.

[1]  T. D. Schneider,et al.  Sequence logos: a new way to display consensus sequences. , 1990, Nucleic acids research.

[2]  Jun S. Liu,et al.  Bayesian Models for Multiple Local Sequence Alignment and Gibbs Sampling Strategies , 1995 .

[3]  Charles Elkan,et al.  Unsupervised learning of multiple motifs in biopolymers using expectation maximization , 1995, Mach. Learn..

[4]  David Haussler,et al.  Dirichlet mixtures: a method for improved detection of weak but significant protein sequence homology , 1996, Comput. Appl. Biosci..

[5]  B. Efron Empirical Bayes Methods for Combining Likelihoods , 1996 .

[6]  B. Efron,et al.  Empirical Bayes Methods for Combining Likelihoods: Comment , 1996 .

[7]  D. S. Fields,et al.  Specificity, free energy and information content in protein-DNA interactions. , 1998, Trends in biochemical sciences.

[8]  G. Church,et al.  Finding DNA regulatory motifs within unaligned noncoding sequences clustered by whole-genome mRNA quantitation , 1998, Nature Biotechnology.

[9]  Gary D. Stormo,et al.  Identifying DNA and protein patterns with statistically significant alignments of multiple sequences , 1999, Bioinform..

[10]  H. Bussemaker,et al.  Building a dictionary for genomes: identification of presumptive regulatory sites by statistical analysis. , 2000, Proceedings of the National Academy of Sciences of the United States of America.

[11]  J. Collado-Vides,et al.  Discovering regulatory elements in non-coding sequences by analysis of spaced dyads. , 2000, Nucleic acids research.

[12]  G. Church,et al.  Computational identification of cis-regulatory elements associated with groups of functionally related genes in Saccharomyces cerevisiae. , 2000, Journal of molecular biology.

[13]  H. Bussemaker,et al.  Regulatory element detection using correlation with expression , 2001, Nature Genetics.

[14]  Gary D. Stormo,et al.  Identifying target sites for cooperatively binding factors , 2001, Bioinform..

[15]  E. Davidson Genomic Regulatory Systems , 2001 .

[16]  Martin C. Frith,et al.  Detection of cis -element clusters in higher eukaryotic DNA , 2001, Bioinform..

[17]  Douglas L. Brutlag,et al.  BioProspector: Discovering Conserved DNA Motifs in Upstream Regulatory Regions of Co-Expressed Genes , 2000, Pacific Symposium on Biocomputing.

[18]  L. Pennacchio,et al.  Genomic strategies to identify mammalian regulatory sequences , 2001, Nature Reviews Genetics.

[19]  Michael I. Jordan,et al.  A Hierarchical Bayesian Markovian Model for Motifs in Biopolymer Sequences , 2002, NIPS.

[20]  G. Rubin,et al.  Exploiting transcription factor binding site clustering to identify cis-regulatory modules involved in pattern formation in the Drosophila genome , 2002, Proceedings of the National Academy of Sciences of the United States of America.

[21]  Jun S. Liu,et al.  An algorithm for finding protein–DNA binding sites with applications to chromatin-immunoprecipitation microarray experiments , 2002, Nature Biotechnology.

[22]  B. Alberts,et al.  Molecular Biology of the Cell 4th edition , 2007 .

[23]  Nir Friedman,et al.  Modeling dependencies in protein-DNA binding sites , 2003, RECOMB '03.

[24]  Michael I. Jordan,et al.  A generalized mean field algorithm for variational inference in exponential families , 2002, UAI.

[25]  Jun S. Liu,et al.  Discovery of Conserved Sequence Patterns Using a Stochastic Dictionary Model , 2003 .

[26]  Michael I. Jordan,et al.  An Introduction to Variational Methods for Graphical Models , 1999, Machine Learning.