Sequence Classification For Protein Analysis

To analyze the sequential data obtained from genome sequencing (DNA and the translated protein sequences), from EEG and ECG measurements, from environmental sensors (e.g. to detect earth quakes), or from sensors used in machine fault detection, sequences must be classified. These examples show that in biology, medicine, and control a great demand for sequence classification methods exists in order to evaluate recorded sequences and to understand the data generation process. But until now machine learning techniques failed to offer sufficient solutions for sequence classification. The classifying sequences by treating them as vectors is not feasible because the sequences have different length and the model complexity increases with sequence length which in turn decreases the generalization capability. Recent approaches to sequence classification first extract features from the sequence and then apply classification algorithms on the feature vectors, e.g. kernel methods using kernels which were designed for sequences. However, the quality of these approaches rise and fall with the quality of the extracted features, i.e. constructed kernels. If not enough prior knowledge about relevant sequence features is available these feature extraction approaches are not reliable. In principle the automatic extraction of appropriate sequence features can be realized by time series methods, i.e. by dynamical systems. However, sequence classification differs from traditional time series prediction because only at sequence end the system must supply an output, the class label. Here new problems arise. At each sequence position the classifier may receive important class information needed at the sequence end, therefore it must deal with long-term dependencies which leads to the problem of the “vanishing gradient” [3, 1]. The “vanishing gradient” addresses the characteristics of non-chaotic dynamical systems that the gradient of states with respect to previous states vanishes exponentially with the temporal distance between these states. This feature of non-chaotic systems results from the fact that initial conditions do not have large influence on later states. Therefore non-chaotic system are prevented from learning to store information over time. However learning to store relevant information till sequence end is essential for sequence classification. To avoid the “vanishing gradient” problem we have introduced the “Long Short-Term Memory” (LSTM, [3]) and now report its application to protein motif and fold recognition. A volume conserving mapping of LSTM’s central subarchitecture keeps information and avoids the “vanishing gradient”. Volume