Identification of structurally conserved residues of proteins in absence of structural homologs using neural network ensemble

Motivation: So far various bioinformatics and machine learning techniques applied for identification of sequence and functionally conserved residues in proteins. Although few computational methods are available for the prediction of structurally conserved residues from protein structure, almost all methods require homologous structural information and structure-based alignments, which still prove to be a bottleneck in protein structure comparison studies. In this work, we developed a neural network approach for identification of structurally important residues from a single protein structure without using homologous structural information and structural alignment. Results: A neural network ensemble (NNE) method that utilizes negative correlation learning (NCL) approach was developed for identification of structurally conserved residues (SCRs) in proteins using features that represent amino acid conservation and composition, physico-chemical properties and structural properties. The NCL-NNE method was applied to 6042 SCRs that have been extracted from 496 protein domains. This method obtained high prediction sensitivity (92.8%) and quality (Matthew's correlation coefficient is 0.852) in identification of SCRs. Further benchmarking using 60 protein domains containing 1657 SCRs that were not part of the training and testing datasets shows that the NCL-NNE can correctly predict SCRs with ∼ 90% sensitivity. These results suggest the usefulness of NCL-NNE for facilitating the identification of SCRs utilizing information derived from a single protein structure. Therefore, this method could be extremely effective in large-scale benchmarking studies where reliable structural homologs and alignments are limited. Availability: The executable for the NCL-NNE algorithm is available at http://www3.ntu.edu.sg/home/EPNSugan/index_files/SCR.htm Contact: epnsugan@ntu.edu.sg; chakraba@ncbi.nlm.nih.gov. Supplementary information: Supplementary data are available at Bioinformatics online.

[1]  Stefano Pascarella,et al.  CAMPO, SCR_FIND and CHC_FIND: a suite of web tools for computational structural biology , 2005, Nucleic Acids Res..

[2]  Xin Yao,et al.  Ensemble learning via negative correlation , 1999, Neural Networks.

[3]  L. Mirny,et al.  Evolutionary conservation of the folding nucleus. , 2000, Journal of molecular biology.

[4]  Liam J. McGuffin,et al.  The PSIPRED protein structure prediction server , 2000, Bioinform..

[5]  C. Anfinsen Principles that govern the folding of protein chains. , 1973, Science.

[6]  K. Brew,et al.  Role of conserved residues in structure and stability: Tryptophans of human serum retinol‐binding protein, a model for the lipocalin superfamily , 2001, Protein Science.

[7]  Narayanaswamy Srinivasan,et al.  CUSP: an algorithm to distinguish structurally conserved and unconserved regions in protein domain alignments and its application in the study of large length variations , 2008, BMC Structural Biology.

[8]  Ramanathan Sowdhamini,et al.  SSToSS - Sequence-Structural Templates of Single-Member Superfamilies , 2006, Silico Biol..

[9]  Charlotte M. Deane,et al.  JOY: protein sequence-structure representation and analysis , 1998, Bioinform..

[10]  C. Chothia Proteins. One thousand families for the molecular biologist. , 1992, Nature.

[11]  John P. Overington,et al.  A structural basis for sequence comparisons. An evaluation of scoring methodologies. , 1993, Journal of molecular biology.

[12]  Ponnuthurai N. Suganthan,et al.  MegaMotifBase: a database of structural motifs in protein families and superfamilies , 2008, Nucleic Acids Res..

[13]  Manfred M. Fischer,et al.  Neural network ensembles and their application to traffic flow prediction in telecommunications networks , 2001, IJCNN'01. International Joint Conference on Neural Networks. Proceedings (Cat. No.01CH37222).

[14]  D. Brutlag,et al.  FoldMiner: Structural motif discovery using an improved superposition algorithm , 2004, Protein science : a publication of the Protein Society.

[15]  Serrano,et al.  Structure of the transition state for folding of the 129 aa protein CheY resembles that of a smaller protein, CI-2. , 1995, Folding & design.

[16]  C. Chothia One thousand families for the molecular biologist , 1992, Nature.

[17]  T. N. Bhat,et al.  The Protein Data Bank , 2000, Nucleic Acids Res..

[18]  Jun S. Liu,et al.  Gibbs motif sampling: Detection of bacterial outer membrane protein repeats , 1995, Protein science : a publication of the Protein Society.

[19]  P. Suganthan,et al.  Identification of catalytic residues from protein structure using support vector machine with sequence and structural features. , 2008, Biochemical and biophysical research communications.

[20]  Ramanathan Sowdhamini,et al.  PASS2: an automated database of protein alignments organised as structural superfamilies , 2004, BMC Bioinformatics.

[21]  Hiroyuki Ogata,et al.  AAindex: Amino Acid Index Database , 1999, Nucleic Acids Res..

[22]  Eunseog Youn,et al.  Bmc Structural Biology Identification of Similar Regions of Protein Structures Using Integrated Sequence and Structure Analysis Tools , 2022 .

[23]  Xin Yao,et al.  Simultaneous training of negatively correlated neural networks in an ensemble , 1999, IEEE Trans. Syst. Man Cybern. Part B.

[24]  M J Sternberg,et al.  Identification of sequence motifs from a set of proteins with related function. , 1994, Protein engineering.

[25]  J. Thompson,et al.  CLUSTAL W: improving the sensitivity of progressive multiple sequence alignment through sequence weighting, position-specific gap penalties and weight matrix choice. , 1994, Nucleic acids research.

[26]  J. Wade Davis,et al.  Statistical Pattern Recognition , 2003, Technometrics.

[27]  Saikat Chakrabarti,et al.  Regions of minimal structural variation among members of protein domain superfamilies: application to remote homology detection and modelling using distant relationships , 2004, FEBS letters.

[28]  Saikat Chakrabarti,et al.  SMoS: a database of structural motifs of protein superfamilies. , 2003, Protein engineering.

[29]  Ponnuthurai N. Suganthan,et al.  SMotif: a server for structural motifs in proteins , 2007, Bioinform..