Kernel-Based Data Fusion and Its Application to Protein Function Prediction in Yeast

Kernel methods provide a principled framework in which to represent many types of data, including vectors, strings, trees and graphs. As such, these methods are useful for drawing inferences about biological phenomena. We describe a method for combining multiple kernel representations in an optimal fashion, by formulating the problem as a convex optimization problem that can be solved using semidefinite programming techniques. The method is applied to the problem of predicting yeast protein functional classifications using a support vector machine (SVM) trained on five types of data. For this problem, the new method performs better than a previously-described Markov random field method, and better than the SVM trained on any single type of data.

[1]  M S Waterman,et al.  Identification of common molecular subsequences. , 1981, Journal of molecular biology.

[2]  J. Hanley,et al.  The meaning and use of the area under a receiver operating characteristic (ROC) curve. , 1982, Radiology.

[3]  C. Berg,et al.  Harmonic Analysis on Semigroups: Theory of Positive Definite and Related Functions , 1984 .

[4]  Bernhard E. Boser,et al.  A training algorithm for optimal margin classifiers , 1992, COLT '92.

[5]  Stephen P. Boyd,et al.  Semidefinite Programming , 1996, SIAM Rev..

[6]  Michael Ruogu Zhang,et al.  Comprehensive identification of cell cycle-regulated genes of the yeast Saccharomyces cerevisiae by microarray hybridization. , 1998, Molecular biology of the cell.

[7]  Alexander J. Smola,et al.  Learning with kernels , 1998 .

[8]  D. Eisenberg,et al.  A combined algorithm for genome-wide prediction of protein function , 1999, Nature.

[9]  M. Gerstein,et al.  A Bayesian system integrating expression data with sequence patterns for localizing proteins: comprehensive application to the yeast genome. , 2000, Journal of molecular biology.

[10]  Ian Holmes,et al.  Finding Regulatory Elements Using Joint Likelihoods for Sequence and Expression Profile Data , 2000, ISMB.

[11]  Jason Weston,et al.  Gene functional classification from heterogeneous data , 2001, RECOMB.

[12]  A. Grigoriev A relationship between gene expression and protein interactions on the proteome scale: analysis of the bacteriophage T7 and the yeast Saccharomyces cerevisiae. , 2001, Nucleic acids research.

[13]  G. Church,et al.  Correlation between transcriptome and interactome mapping data from Saccharomyces cerevisiae , 2001, Nature Genetics.

[14]  Bernhard Schölkopf,et al.  Learning with kernels , 2001 .

[15]  M. Kanehisa,et al.  Extraction of correlated gene clusters by multiple graph comparison. , 2001, Genome informatics. International Conference on Genome Informatics.

[16]  John D. Lafferty,et al.  Diffusion Kernels on Graphs and Other Discrete Input Spaces , 2002, ICML.

[17]  B. Snel,et al.  Comparative assessment of large-scale data sets of protein–protein interactions , 2002, Nature.

[18]  Roded Sharan,et al.  Discovering statistically significant biclusters in gene expression data , 2002, ISMB.

[19]  D. Holste,et al.  Does mapping reveal correlation between gene expression and protein–protein interaction? , 2003, Nature Genetics.

[20]  Ting Chen,et al.  An integrated probabilistic model for functional prediction of proteins , 2003, RECOMB '03.

[21]  Ting Chen,et al.  Assessment of the reliability of protein-protein interactions and protein function prediction , 2002, Pacific Symposium on Biocomputing.

[22]  Nello Cristianini,et al.  Learning the Kernel Matrix with Semidefinite Programming , 2002, J. Mach. Learn. Res..

[23]  Bernhard Schölkopf,et al.  Support Vector Machine Applications in Computational Biology , 2004 .

[24]  M. Gerstein,et al.  Integration of genomic datasets to predict protein complexes in yeast , 2004, Journal of Structural and Functional Genomics.

[25]  William Stafiord Noble,et al.  Support vector machine applications in computational biology , 2004 .