Binary classification of protein molecules into intrinsically disordered and ordered segments

BackgroundAlthough structural domains in proteins (SDs) are important, half of the regions in the human proteome are currently left with no SD assignments. These unassigned regions consist not only of novel SDs, but also of intrinsically disordered (ID) regions since proteins, especially those in eukaryotes, generally contain a significant fraction of ID regions. As ID regions can be inferred from amino acid sequences, a method that combines SD and ID region assignments can determine the fractions of SDs and ID regions in any proteome.ResultsIn contrast to other available ID prediction programs that merely identify likely ID regions, the DICHOT system we previously developed classifies the entire protein sequence into SDs and ID regions. Application of DICHOT to the human proteome revealed that residue-wise ID regions constitute 35%, SDs with similarity to PDB structures comprise 52%, while SDs with no similarity to PDB structures account for the remaining 13%. The last group consists of novel structural domains, termed cryptic domains, which serve as good targets of structural genomics. The DICHOT method applied to the proteomes of other model organisms indicated that eukaryotes generally have high ID contents, while prokaryotes do not. In human proteins, ID contents differ among subcellular localizations: nuclear proteins had the highest residue-wise ID fraction (47%), while mitochondrial proteins exhibited the lowest (13%). Phosphorylation and O-linked glycosylation sites were found to be located preferentially in ID regions. As O-linked glycans are attached to residues in the extracellular regions of proteins, the modification is likely to protect the ID regions from proteolytic cleavage in the extracellular environment. Alternative splicing events tend to occur more frequently in ID regions. We interpret this as evidence that natural selection is operating at the protein level in alternative splicing.ConclusionsWe classified entire regions of proteins into the two categories, SDs and ID regions and thereby obtained various kinds of complete genome-wide statistics. The results of the present study are important basic information for understanding protein structural architectures and have been made publicly available at http://spock.genes.nig.ac.jp/~genome/DICHOT.

[1]  María Martín,et al.  The Universal Protein Resource (UniProt) in 2010 , 2010 .

[2]  Shuichi Hirose,et al.  BIOINFORMATICS APPLICATIONS NOTE doi:10.1093/bioinformatics/btm330 Structural bioinformatics , 2022 .

[3]  H. Dyson,et al.  Intrinsically unstructured proteins: re-assessing the protein structure-function paradigm. , 1999, Journal of molecular biology.

[4]  Ikuko Nishikawa,et al.  Computational Prediction of O-linked Glycosylation Sites That Preferentially Map on Intrinsically Disordered Regions of Extracellular Proteins , 2010, International journal of molecular sciences.

[5]  Takashi Gojobori,et al.  Development of an accurate classification system of proteins into structured and unstructured regions that uncovers novel structural domains: its application to human transcription factors , 2009, BMC Structural Biology.

[6]  J. Berg Genome sequence of the nematode C. elegans: a platform for investigating biology. , 1998, Science.

[7]  C. Chothia One thousand families for the molecular biologist , 1992, Nature.

[8]  Johannes Buchner,et al.  p53 contains large unstructured regions in its native state. , 2002, Journal of molecular biology.

[9]  Dmitrij Frishman,et al.  PEDANT genome database: 10 years online , 2006, Nucleic Acids Res..

[10]  T. Gibson,et al.  Protein disorder prediction: implications for structural proteomics. , 2003, Structure.

[11]  Stephen M. Mount,et al.  The genome sequence of Drosophila melanogaster. , 2000, Science.

[12]  K. Nishikawa,et al.  Alternative splice variants encoding unstable protein domains exist in the human brain. , 2004, Journal of molecular biology.

[13]  Raj Kumar,et al.  Induced alpha-helix structure in AF1 of the androgen receptor upon binding transcription factor TFIIF. , 2004, Biochemistry.

[14]  S. Ishii,et al.  Solution structure of the transactivation domain of ATF-2 comprising a zinc finger-like subdomain and a flexible subdomain. , 1999, Journal of molecular biology.

[15]  Christopher J. Oldfield,et al.  Functional anthology of intrinsic disorder. 1. Biological processes and functions of proteins with long disordered regions. , 2007, Journal of proteome research.

[16]  Takeshi Itoh,et al.  Alternative splicing in human transcriptome: functional and structural influence on proteins. , 2006, Gene.

[17]  P. Romero,et al.  Conservation of intrinsic disorder in protein domains and families: I. A database of conserved predicted disordered regions. , 2006, Journal of Proteome Research.

[18]  C. Chothia Proteins. One thousand families for the molecular biologist. , 1992, Nature.

[19]  Hideaki Sugawara,et al.  The GTOP database in 2009: updated content and novel features to expand and deepen insights into protein structures and functions , 2008, Nucleic Acids Res..

[20]  L. Holm,et al.  Exhaustive enumeration of protein domain families. , 2003, Journal of molecular biology.

[21]  Andrew Smith Genome sequence of the nematode C-elegans: A platform for investigating biology , 1998 .

[22]  C. Brown,et al.  Intrinsic protein disorder in complete genomes. , 2000, Genome informatics. Workshop on Genome Informatics.

[23]  David T. Jones,et al.  Protein superfamilles and domain superfolds , 1994, Nature.

[24]  P. Tompa,et al.  Prevalent structural disorder in E. coli and S. cerevisiae proteomes. , 2006, Journal of proteome research.

[25]  K. Nishikawa,et al.  Human transcription factors contain a high fraction of intrinsically disordered regions essential for transcriptional regulation. , 2006, Journal of molecular biology.

[26]  A Keith Dunker,et al.  Conservation of intrinsic disorder in protein domains and families: II. functions of conserved disorder. , 2006, Journal of proteome research.

[27]  L. Holm,et al.  The Pfam protein families database , 2005, Nucleic Acids Res..

[28]  S. Vucetic,et al.  Flavors of protein disorder , 2003, Proteins.

[29]  John Moult,et al.  Structural implication of splicing stochastics , 2009 .

[30]  P. Romero,et al.  Sequence complexity of disordered protein , 2001, Proteins.

[31]  C. Kurland,et al.  The Dual Origin of the Yeast Mitochondrial Proteome , 2000, Yeast.

[32]  V. Uversky,et al.  Why are “natively unfolded” proteins unstructured under physiologic conditions? , 2000, Proteins.

[33]  The Arabidopsis Genome Initiative Analysis of the genome sequence of the flowering plant Arabidopsis thaliana , 2000, Nature.

[34]  Monika Fuxreiter,et al.  Close encounters of the third kind: disordered domains and the interactions of proteins , 2009, BioEssays : news and reviews in molecular, cellular and developmental biology.

[35]  Yutaka Kuroda,et al.  POODLE-L: a two-level SVM prediction system for reliably predicting long disordered regions , 2007, Bioinform..

[36]  L. Iakoucheva,et al.  Intrinsic disorder in cell-signaling and cancer-associated proteins. , 2002, Journal of molecular biology.

[37]  P. Tompa The interplay between structure and function in intrinsically unstructured proteins , 2005, FEBS letters.

[38]  Andrei L Lomize,et al.  Bmc Structural Biology , 2022 .

[39]  T. Ando,et al.  Visualization of intrinsically disordered regions of proteins by high-speed atomic force microscopy. , 2008, Chemphyschem : a European journal of chemical physics and physical chemistry.

[40]  K. Nishikawa Natively unfolded proteins: An overview , 2009, Biophysics.

[41]  Yoshihiro Kawahara,et al.  The Rice Annotation Project Database (RAP-DB): 2008 update , 2007, Nucleic Acids Res..

[42]  K Nishikawa,et al.  Discrimination of intracellular and extracellular proteins using amino acid composition and residue-pair frequencies. , 1994, Journal of molecular biology.

[43]  Cyrus Chothia,et al.  The SUPERFAMILY database in 2007: families and functions , 2006, Nucleic Acids Res..

[44]  Tim J. P. Hubbard,et al.  Data growth and its impact on the SCOP database: new developments , 2007, Nucleic Acids Res..

[45]  H. Dyson,et al.  Intrinsically unstructured proteins and their functions , 2005, Nature Reviews Molecular Cell Biology.

[46]  David J. Weber,et al.  Structure of the negative regulatory domain of p53 bound to S100B(ββ) , 2000, Nature Structural Biology.

[47]  A. Goffeau,et al.  The complete genome sequence of the Gram-positive bacterium Bacillus subtilis , 1997, Nature.

[48]  Kyou-Hoon Han,et al.  Local Structural Elements in the Mostly Unstructured Transcriptional Activation Domain of Human p53* , 2000, The Journal of Biological Chemistry.

[49]  Kengo Kinoshita,et al.  Prediction of disordered regions in proteins based on the meta approach , 2008, Bioinform..

[50]  Jessica W. Chen Conversation of Intrinsic Disorder in Protein Domains and Families , 2005 .

[51]  D. Kingsley,et al.  Use of a mutant cell line to study the kinetics and function of O-linked glycosylation of low density lipoprotein receptors. , 1988, Proceedings of the National Academy of Sciences of the United States of America.

[52]  John Moult,et al.  Stochastic noise in splicing machinery , 2009 .

[53]  B. Barrell,et al.  Life with 6000 Genes , 1996, Science.

[54]  C. Kurland,et al.  On the origin of mitochondria: a genomics perspective. , 2003, Philosophical transactions of the Royal Society of London. Series B, Biological sciences.

[55]  Baris E. Suzek,et al.  The Universal Protein Resource (UniProt) in 2010 , 2009, Nucleic Acids Res..

[56]  B. Barrell,et al.  The genome sequence of Schizosaccharomyces pombe , 2002, Nature.

[57]  C. Chothia,et al.  Assignment of homology to genome sequences using a library of hidden Markov models that represent all proteins of known structure. , 2001, Journal of molecular biology.

[58]  K. Nishikawa,et al.  Intrinsically disordered regions of human plasma membrane proteins preferentially occur in the cytoplasmic segment. , 2007, Journal of molecular biology.

[59]  L. Iakoucheva,et al.  The importance of intrinsic disorder for protein phosphorylation. , 2004, Nucleic acids research.

[60]  J. J. Lucas,et al.  Elimination of the O-linked glycosylation site at Thr 104 results in the generation of a soluble human-transferrin receptor. , 1994, Blood.

[61]  Johannes Buchner,et al.  The N-terminal domain of p53 is natively unfolded. , 2003, Journal of molecular biology.

[62]  Sonia Longhi,et al.  Assessing protein disorder and induced folding , 2005, Proteins.

[63]  A Keith Dunker,et al.  Alternative splicing in concert with protein intrinsic disorder enables increased functional diversity in multicellular organisms. , 2006, Proceedings of the National Academy of Sciences of the United States of America.

[64]  N. W. Davis,et al.  The complete genome sequence of Escherichia coli K-12. , 1997, Science.

[65]  J. S. Sodhi,et al.  Prediction and functional analysis of native disorder in proteins from the three kingdoms of life. , 2004, Journal of molecular biology.

[66]  B. Berger,et al.  MultiCoil: A program for predicting two‐and three‐stranded coiled coils , 1997, Protein science : a publication of the Protein Society.

[67]  J. DiRuggiero,et al.  Divergence of the hyperthermophilic archaea Pyrococcus furiosus and P. horikoshii inferred from complete genomic sequences. , 1999, Genetics.

[68]  A. Dunker,et al.  Understanding protein non-folding. , 2010, Biochimica et biophysica acta.