Clustering and Visualization of Large Protein Sequence Databases by Means of an Extension on the Self-Organizing Map

New, more effective software tools are needed for the analysis and organization of the continually growing biological databases. An extension of the Self-Organizing Map (SOM) is used in this work for the clustering of all the 77,977 protein sequences of the SWISS-PROT database, release 37. In this method, unlike in some previous ones, the data sequences are not converted into histogram vectors in order to perform the clustering. Instead, a collection of true representative model sequences that approximate the contents of the database in a compact way is found automatically, based on the concept of the generalized median of symbol strings, after the user has defined any proper similarity measure for the sequences such as Smith-Waterman, BLAST, or FASTA. The FASTA method is used in this work. The benefits of the SOM and also those of its extension are fast computation, approximate representation of the large database by means of a much smaller, fixed number of model sequences, and an easy interpretation of the clustering by means of visualization. The complete sequence database is mapped onto a two-dimensional graphic SOM display, and clusters of similar sequences are then found and made visible by indicating the degree of similarity of the adjacent model sequences by shades of gray.

[1]  Myron Wish,et al.  Three-Way Multidimensional Scaling , 1978 .

[2]  Erkki Oja,et al.  Engineering applications of the self-organizing map , 1996, Proc. IEEE.

[3]  E. A. Ferr Topological maps of protein sequences , .

[4]  E. Myers,et al.  Basic local alignment search tool. , 1990, Journal of molecular biology.

[5]  Amos Bairoch,et al.  The PROSITE database, its status in 1999 , 1999, Nucleic Acids Res..

[6]  T. Kohonen Self-Organized Formation of Correct Feature Maps , 1982 .

[7]  Jens G. Reich,et al.  Kohonen map as a visualization tool for the analysis of protein sequences: multiple alignments, domains and segments of secondary structures , 1996, Comput. Appl. Biosci..

[8]  Teuvo Kohonen,et al.  Median strings , 1985, Pattern Recognit. Lett..

[9]  Teuvo Kohonen,et al.  Self-Organizing Maps , 2010 .

[10]  E A Ferrán,et al.  Self‐organized neural maps of human protein sequences , 1994, Protein science : a publication of the Protein Society.

[11]  T. Kohonen Self-organized formation of topographically correct feature maps , 1982 .

[12]  Teuvo Kohonen,et al.  The self-organizing map , 1990 .

[13]  Yizong Cheng Convergence and Ordering of Kohonen's Batch Map , 1997, Neural Computation.

[14]  L. Rabiner,et al.  The acoustics, speech, and signal processing society - A historical perspective , 1984, IEEE ASSP Magazine.

[15]  Miguel A. Andrade-Navarro,et al.  Classification of protein families and detection of the determinant residues with an improved self-organizing map , 1997, Biological Cybernetics.

[16]  Amos Bairoch,et al.  The PROSITE database, its status in 1997 , 1997, Nucleic Acids Res..

[17]  Panu Somervuo,et al.  Self-organizing maps of symbol strings , 1998, Neurocomputing.

[18]  R. Gray,et al.  Vector quantization , 1984, IEEE ASSP Magazine.

[19]  Temple F. Smith,et al.  Comparison of biosequences , 1981 .

[20]  D. Lipman,et al.  Improved tools for biological sequence comparison. , 1988, Proceedings of the National Academy of Sciences of the United States of America.