Classifying biological sequences is one of the most important tasks in computational biology. In the last decade, support vector machines (SVMs) in combination with sequence kernels have emerged as a de-facto standard. These methods are theoretically well-founded, reliable, and provide high-accuracy solutions at low computational cost. However, obtaining a highly accurate classifier is rarely the end of the story in many practical situations. Instead, one often aims to acquire biological knowledge about the principles underlying a given classification task. SVMs with traditional sequence kernels do not offer a straightforward way of accessing this knowledge.In this contribution, we propose a new approach to analyzing biological sequences on the basis of support vector machines with sequence kernels. We first extract explicit pattern weights from a given SVM. When classifying a sequence, we then compute a prediction profile by distributing the weight of each pattern to the sequence positions that match the pattern. The final profile not only allows assessing the importance of a position, but also determining for which class it is indicative. Since it is unfeasible to analyze profiles of all sequences in a given data set, we advocate using affinity propagation (AP) clustering to narrow down the analysis to a small set of typical sequences.The proposed approach is applicable to a wide range of biological sequences and a wide selection of sequence kernels. To illustrate our framework, we present the prediction of oligomerization tendencies of coiled coil proteins as a case study.
[1]
Corinna Cortes,et al.
Support-Vector Networks
,
1995,
Machine Learning.
[2]
Ulrich Bodenhofer,et al.
Modeling Position Specificity in Sequence Kernels by Fuzzy Equivalence Relations
,
2009,
IFSA/EUSFLAT Conf..
[3]
Delbert Dueck,et al.
Clustering by Passing Messages Between Data Points
,
2007,
Science.
[4]
Eleazar Eskin,et al.
The Spectrum Kernel: A String Kernel for SVM Protein Classification
,
2001,
Pacific Symposium on Biocomputing.
[5]
Jason Weston,et al.
Mismatch string kernels for discriminative protein classification
,
2004,
Bioinform..
[6]
V. Pavlovic,et al.
A fast , large-scale learning method for protein sequence classification
,
2008
.