Nonparametric Combinatorial Sequence Models

This work considers biological sequences that exhibit combinatorial structures in their composition: groups of positions of the aligned sequences are "linked" and covary as one unit across sequences. If multiple such groups exist, complex interactions can emerge between them. Sequences of this kind arise frequently in biology but methodologies for analyzing them are still being developed. This paper presents a nonparametric prior on sequences which allows combinatorial structures to emerge and which induces a posterior distribution over factorized sequence representations. We carry out experiments on three sequence datasets which indicate that combinatorial structures are indeed present and that combinatorial sequence models can more succinctly describe them than simpler mixture models. We conclude with an application to MHC binding prediction which highlights the utility of the posterior distribution induced by the prior. By integrating out the posterior our method compares favorably to leading binding predictors.

[1]  G. Schwarz Estimating the Dimension of a Model , 1978 .

[2]  L. Hubert,et al.  Comparing partitions , 1985 .

[3]  J. Rissanen Stochastic Complexity in Statistical Inquiry Theory , 1989 .

[4]  K.,et al.  Contrasting roles of interallelic recombination at the HLA-A and HLA-B loci. , 1993, Genetics.

[5]  D. Schatz,et al.  The RAG proteins and V(D)J recombination: complexes, ends, and transposition. , 2000, Annual review of immunology.

[6]  S. MacEachern Decision Theoretic Aspects of Dependent Nonparametric Processes , 2000 .

[7]  F. P. Roth,et al.  A non-parametric model for transcription factor binding sites. , 2003, Nucleic acids research.

[8]  John Sidney,et al.  Examining the independent binding assumption for binding of peptide epitopes to MHC-I molecules , 2003, Bioinform..

[9]  O. Lund,et al.  novel sequence representations Reliable prediction of T-cell epitopes using neural networks with , 2003 .

[10]  Roded Sharan,et al.  Bayesian haplo-type inference via the dirichlet process , 2004, ICML.

[11]  Nebojsa Jojic,et al.  Efficient approximations for learning phylogenetic HMM models from data , 2004, ISMB/ECCB.

[12]  Nebojsa Jojic,et al.  Joint Discovery of Haplotype Blocks and Complex Trait Associations from SNP Sequences , 2004, UAI.

[13]  N. Jojic,et al.  Capturing image structure with probabilistic index maps , 2004, Proceedings of the 2004 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 2004. CVPR 2004..

[14]  A. Lapedes,et al.  Mapping the Antigenic and Genetic Evolution of Influenza Virus , 2004, Science.

[15]  Brendan J. Frey,et al.  Using ``epitomes'' to model genetic diversity: Rational design of HIV vaccine cocktails , 2005, NIPS 2005.

[16]  Jeff A. Bilmes,et al.  Q-Clustering , 2005, NIPS.

[17]  Yee Whye Teh,et al.  Bayesian multi-population haplotype inference via a hierarchical dirichlet process mixture , 2006, ICML.

[18]  Morten Nielsen,et al.  A Community Resource Benchmarking Predictions of Peptide Binding to MHC-I Molecules , 2006, PLoS Comput. Biol..

[19]  J. Pitman Combinatorial Stochastic Processes , 2006 .

[20]  Ora Schueler-Furman,et al.  Learning MHC I - peptide binding , 2006, ISMB.

[21]  Michael I. Jordan,et al.  Hierarchical Dirichlet Processes , 2006 .

[22]  Zhaohui S. Qin,et al.  Clustering microarray gene expression data using weighted Chinese restaurant process , 2006, Bioinform..

[23]  Roded Sharan,et al.  Bayesian Haplotype Inference via the Dirichlet Process , 2007, J. Comput. Biol..

[24]  Joseph Bockhorst,et al.  Structural Polymorphism and Diversifying Selection on the Pregnancy Malaria Vaccine Candidate Var2csa , 2007 .

[25]  Nebojsa Jojic,et al.  Discovering Patterns in Biological Sequences by Optimal Segmentation , 2007, UAI.

[26]  Michael I. Jordan,et al.  Neighbor-Dependent Ramachandran Probability Distributions of Amino Acids Developed from a Hierarchical Dirichlet Process Model , 2010, PLoS Comput. Biol..