论文信息 - Decoding Sequence Classification Models for Acquiring New Biological Insights

Decoding Sequence Classification Models for Acquiring New Biological Insights

Classifying biological sequences is one of the most important tasks in computational biology. In the last decade, support vector machines (SVMs) in combination with sequence kernels have emerged as a de-facto standard. These methods are theoretically well-founded, reliable, and provide high-accuracy solutions at low computational cost. However, obtaining a highly accurate classifier is rarely the end of the story in many practical situations. Instead, one often aims to acquire biological knowledge about the principles underlying a given classification task. SVMs with traditional sequence kernels do not offer a straightforward way of accessing this knowledge.In this contribution, we propose a new approach to analyzing biological sequences on the basis of support vector machines with sequence kernels. We first extract explicit pattern weights from a given SVM. When classifying a sequence, we then compute a prediction profile by distributing the weight of each pattern to the sequence positions that match the pattern. The final profile not only allows assessing the importance of a position, but also determining for which class it is indicative. Since it is unfeasible to analyze profiles of all sequences in a given data set, we advocate using affinity propagation (AP) clustering to narrow down the analysis to a small set of typical sequences.The proposed approach is applicable to a wide range of biological sequences and a wide selection of sequence kernels. To illustrate our framework, we present the prediction of oligomerization tendencies of coiled coil proteins as a case study.

Ulrich Bodenhofer | Sepp Hochreiter | Andreas Kothmeier | Ingrid G. Abfalter | Carsten C. Mahrenholz

[1] Corinna Cortes,et al. Support-Vector Networks , 1995, Machine Learning.

[2] Ulrich Bodenhofer,et al. Modeling Position Specificity in Sequence Kernels by Fuzzy Equivalence Relations , 2009, IFSA/EUSFLAT Conf..

[3] Delbert Dueck,et al. Clustering by Passing Messages Between Data Points , 2007, Science.

[4] Eleazar Eskin,et al. The Spectrum Kernel: A String Kernel for SVM Protein Classification , 2001, Pacific Symposium on Biocomputing.

[5] Jason Weston,et al. Mismatch string kernels for discriminative protein classification , 2004, Bioinform..

[6] V. Pavlovic,et al. A fast , large-scale learning method for protein sequence classification , 2008 .