Identification of functionally diverse lipocalin proteins from sequence information using support vector machine

Lipocalins are functionally diverse proteins that are composed of 120–180 amino acid residues. Members of this family have several important biological functions including ligand transport, cryptic coloration, sensory transduction, endonuclease activity, stress response activity in plants, odorant binding, prostaglandin biosynthesis, cellular homeostasis regulation, immunity, immunotherapy and so on. Identification of lipocalins from protein sequence is more challenging due to the poor sequence identity which often falls below the twilight zone. So far, no specific method has been reported to identify lipocalins from primary sequence. In this paper, we report a support vector machine (SVM) approach to predict lipocalins from protein sequence using sequence-derived properties. LipoPred was trained using a dataset consisting of 325 lipocalin proteins and 325 non-lipocalin proteins, and evaluated by an independent set of 140 lipocalin proteins and 21,447 non-lipocalin proteins. LipoPred achieved 88.61% accuracy with 89.26% sensitivity, 85.27% specificity and 0.74 Matthew’s correlation coefficient (MCC). When applied on the test dataset, LipoPred achieved 84.25% accuracy with 88.57% sensitivity, 84.22% specificity and MCC of 0.16. LipoPred achieved better performance rate when compared with PSI-BLAST, HMM and SVM-Prot methods. Out of 218 lipocalins, LipoPred correctly predicted 194 proteins including 39 lipocalins that are non-homologous to any protein in the SWISSPROT database. This result shows that LipoPred is potentially useful for predicting the lipocalin proteins that have no sequence homologs in the sequence databases. Further, successful prediction of nine hypothetical lipocalin proteins and five new members of lipocalin family prove that LipoPred can be efficiently used to identify and annotate the new lipocalin proteins from sequence databases. The LipoPred software and dataset are available at http://www3.ntu.edu.sg/home/EPNSugan/index_files/lipopred.htm.

[1]  P. Suganthan,et al.  Identification of catalytic residues from protein structure using support vector machine with sequence and structural features. , 2008, Biochemical and biophysical research communications.

[2]  Ghislain Breton,et al.  Molecular and structural analyses of a novel temperature stress‐induced lipocalin from wheat and Arabidopsis , 2002, FEBS letters.

[3]  Corinna Cortes,et al.  Support-Vector Networks , 1995, Machine Learning.

[4]  Rolf Apweiler,et al.  The SWISS-PROT protein sequence database and its supplement TrEMBL in 2000 , 2000, Nucleic Acids Res..

[5]  K. Chou,et al.  REVIEW : Recent advances in developing web-servers for predicting protein attributes , 2009 .

[6]  H. Yamamoto,et al.  Plant lipocalins: violaxanthin de-epoxidase and zeaxanthin epoxidase. , 2000, Biochimica et biophysica acta.

[7]  T. Attwood,et al.  Structure and sequence relationships in the lipocalins and related proteins , 1993, Protein science : a publication of the Protein Society.

[8]  P. Devarajan Neutrophil gelatinase-associated lipocalin: new paths for an old shuttle. , 2007, Cancer therapy.

[9]  J. Grzyb,et al.  Lipocalins - a family portrait. , 2006, Journal of plant physiology.

[10]  B. Glasgow,et al.  Endonuclease activity in lipocalins. , 2000, The Biochemical journal.

[11]  J. B. Massey,et al.  Structure of human apolipoprotein D: locations of the intermolecular and intramolecular disulfide links. , 1994, Biochemistry.

[12]  Sean R. Eddy,et al.  Profile hidden Markov models , 1998, Bioinform..

[13]  P. Venge,et al.  Lipocalins as biochemical markers of disease. , 2000, Biochimica et biophysica acta.

[14]  Kuo-Chen Chou,et al.  Prediction of Protein Structural Classes by Support Vector Machines , 2002, Comput. Chem..

[15]  中尾 光輝,et al.  KEGG(Kyoto Encyclopedia of Genes and Genomes)〔和文〕 (特集 ゲノム医学の現在と未来--基礎と臨床) -- (データベース) , 2000 .

[16]  P. Suganthan,et al.  Prediction of functionally important sites from protein sequences using sparse kernel least squares classifiers. , 2009, Biochemical and biophysical research communications.

[17]  K. Chou Prediction of protein cellular attributes using pseudo‐amino acid composition , 2001, Proteins.

[18]  D. Sanchez,et al.  A phylogenetic analysis of the lipocalin protein family. , 2000, Molecular biology and evolution.

[19]  Søren Brunak,et al.  Prediction of human protein function according to Gene Ontology categories , 2003, Bioinform..

[20]  Adam Godzik,et al.  Clustering of highly homologous sequences to reduce the size of large protein databases , 2001, Bioinform..

[21]  D. Logan,et al.  Species Specificity in Major Urinary Proteins by Parallel Evolution , 2008, PloS one.

[22]  Kuo-Chen Chou,et al.  Prediction of Membrane Protein Types by Incorporating Amphipathic Effects , 2005, J. Chem. Inf. Model..

[23]  Debashish Bhattacharya,et al.  Evolution of a novel function: nutritive milk in the viviparous cockroach, Diploptera punctata , 2004, Evolution & development.

[24]  D R Flower,et al.  The lipocalin protein family: structural and sequence overview. , 2000, Biochimica et biophysica acta.

[25]  Minoru Kanehisa,et al.  AAindex: Amino Acid index database , 2000, Nucleic Acids Res..

[26]  Arne Skerra,et al.  Lipocalins in drug discovery: from natural ligand-binding proteins to "anticalins". , 2005, Drug discovery today.

[27]  Michael Levitt,et al.  Protein segment finder: an online search engine for segment motifs in the PDB , 2008, Nucleic Acids Res..

[28]  D R Flower,et al.  Lipocalins: unity in diversity. , 2000, Biochimica et biophysica acta.

[29]  Robert D. Finn,et al.  InterPro: the integrative protein signature database , 2008, Nucleic Acids Res..

[30]  B. Glasgow,et al.  Binding studies of tear lipocalin: the role of the conserved tryptophan in maintaining structure, stability and ligand affinity. , 1999, Biochimica et biophysica acta.

[31]  D R Flower,et al.  The lipocalin protein family: structure and function. , 1996, The Biochemical journal.

[32]  R. Bishop,et al.  The bacterial lipocalins. , 2000, Biochimica et biophysica acta.

[33]  R. Durbin,et al.  Pfam: A comprehensive database of protein domain families based on seed alignments , 1997, Proteins.

[34]  F. Walker,et al.  Reversible binding of nitric oxide by a salivary heme protein from a bloodsucking insect. , 1993, Science.

[35]  Edmond Godfroid,et al.  Distantly related lipocalins share two conserved clusters of hydrophobic residues: use in homology modeling. , 2008, BMC structural biology.

[36]  Ian H. Witten,et al.  Data mining in bioinformatics using Weka , 2004, Bioinform..

[37]  J. Dacheux,et al.  Mammalian Lipocalin-Type Prostaglandin D2 Synthase in the Fluids of the Male Genital Tract: Putative Biochemical and Physiological Functions1 , 2002, Biology of reproduction.

[38]  B. Glasgow,et al.  A conserved disulfide motif in human tear lipocalins influences ligand binding. , 1998, Biochemistry.

[39]  Chih-Jen Lin,et al.  LIBSVM: A library for support vector machines , 2011, TIST.

[40]  R. Mäntyjärvi,et al.  Lipocalins as allergens. , 2000, Biochimica et biophysica acta.

[41]  Gunnar Rätsch,et al.  An introduction to kernel-based learning algorithms , 2001, IEEE Trans. Neural Networks.

[42]  X. Chen,et al.  SVM-Prot: web-based support vector machine software for functional classification of a protein from its primary sequence , 2003, Nucleic Acids Res..

[43]  Liam J. McGuffin,et al.  The PSIPRED protein structure prediction server , 2000, Bioinform..

[44]  Kuo-Chen Chou,et al.  Using amphiphilic pseudo amino acid composition to predict enzyme subfamily classes , 2005, Bioinform..