Engineering proteinase K using machine learning and synthetic genes

BackgroundAltering a protein's function by changing its sequence allows natural proteins to be converted into useful molecular tools. Current protein engineering methods are limited by a lack of high throughput physical or computational tests that can accurately predict protein activity under conditions relevant to its final application. Here we describe a new synthetic biology approach to protein engineering that avoids these limitations by combining high throughput gene synthesis with machine learning-based design algorithms.ResultsWe selected 24 amino acid substitutions to make in proteinase K from alignments of homologous sequences. We then designed and synthesized 59 specific proteinase K variants containing different combinations of the selected substitutions. The 59 variants were tested for their ability to hydrolyze a tetrapeptide substrate after the enzyme was first heated to 68°C for 5 minutes. Sequence and activity data was analyzed using machine learning algorithms. This analysis was used to design a new set of variants predicted to have increased activity over the training set, that were then synthesized and tested. By performing two cycles of machine learning analysis and variant design we obtained 20-fold improved proteinase K variants while only testing a total of 95 variant enzymes.ConclusionThe number of protein variants that must be tested to obtain significant functional improvements determines the type of tests that can be performed. Protein engineers wishing to modify the property of a protein to shrink tumours or catalyze chemical reactions under industrial conditions have until now been forced to accept high throughput surrogate screens to measure protein properties that they hope will correlate with the functionalities that they intend to modify. By reducing the number of variants that must be tested to fewer than 100, machine learning algorithms make it possible to use more complex and expensive tests so that only protein properties that are directly relevant to the desired application need to be measured. Protein design algorithms that only require the testing of a small number of variants represent a significant step towards a generic, resource-optimized protein engineering process.

[1]  A. E. Hoerl,et al.  Ridge Regression: Applications to Nonorthogonal Problems , 1970 .

[2]  W. Ebeling,et al.  Proteinase K from Tritirachium album Limber. , 1974, European journal of biochemistry.

[3]  F. Máliš,et al.  p-Nitroanilides of 3-carboxypropionyl-peptides. Their cleavage by elastase, trypsin, and chymotrypsin. , 1976, European journal of biochemistry.

[4]  M. O. Dayhoff,et al.  22 A Model of Evolutionary Change in Proteins , 1978 .

[5]  M. Fassett,et al.  Substrate specificity of human pancreatic elastase 2. , 1980, Biochemistry.

[6]  S. Benner,et al.  Total synthesis and cloning of a gene coding for the ribonuclease S protein. , 1984, Science.

[7]  S. Wold,et al.  The prediction of bradykinin potentiating potency of pentapeptides. An example of a peptide quantitative structure-activity relationship. , 1986, Acta chemica Scandinavica. Series B: Organic chemistry and biochemistry.

[8]  S. Wold,et al.  The Prediction of Bradykinin Potentiating Potency of Pentapeptides. , 1986 .

[9]  F. Neidhardt,et al.  Escherichia Coli and Salmonella: Typhimurium Cellular and Molecular Biology , 1987 .

[10]  S. Wold,et al.  Peptide quantitative structure-activity relationships, a multivariate approach. , 1987, Journal of medicinal chemistry.

[11]  Gunkel Fa,et al.  Proteinase K from Tritirachium album Limber. Characterization of the chromosomal gene and expression of the cDNA in Escherichia coli. , 1989 .

[12]  H. Takagi,et al.  [Protein engineering on subtilisin]. , 1992, Seikagaku. The Journal of Japanese Biochemical Society.

[13]  S. Wold,et al.  Peptide QSAR on substance P analogues, enkephalins and bradykinins containing L- and D-amino acids. , 1990, Acta Chemica Scandinavica.

[14]  Frances H. Arnold,et al.  Enzyme Engineering for Nonaqueous Solvents: Random Mutagenesis to Enhance Activity of Subtilisin E in Polar Organic Media , 1991, Bio/Technology.

[15]  J Huang,et al.  Construction of synthetic genes using PCR after automated DNA synthesis of their entire top and bottom strands. , 1991, Nucleic acids research.

[16]  S Wold,et al.  Quantitative sequence-activity models (QSAM)--tools for sequence design. , 1993, Nucleic acids research.

[17]  W. Näther Optimum experimental designs , 1994 .

[18]  W. Stemmer DNA shuffling by random fragmentation and reassembly: in vitro recombination for molecular evolution. , 1994, Proceedings of the National Academy of Sciences of the United States of America.

[19]  J. Thompson,et al.  CLUSTAL W: improving the sensitivity of progressive multiple sequence alignment through sequence weighting, position-specific gap penalties and weight matrix choice. , 1994, Nucleic acids research.

[20]  P. Laycock,et al.  Optimum Experimental Designs , 1995 .

[21]  C. Sander,et al.  A method to predict functional residues in proteins , 1995, Nature Structural Biology.

[22]  Manfred K. Warmuth,et al.  Worst-case Loss Bounds for Single Neurons , 1995, NIPS.

[23]  Alexander J. Smola,et al.  Support Vector Regression Machines , 1996, NIPS.

[24]  T. Auton,et al.  Design of active analogues of a 15-residue peptide using D-optimal design, QSAR and a combinatorial search algorithm. , 2009, The journal of peptide research : official journal of the American Peptide Society.

[25]  U Norinder,et al.  A quantitative structure-activity relationship study of some substance P-related peptides. A multivariate approach using PLS and variable selection. , 2009, The journal of peptide research : official journal of the American Peptide Society.

[26]  W. Stemmer,et al.  DNA shuffling of a family of genes from diverse species accelerates directed evolution , 1998, Nature.

[27]  G Bucht,et al.  Optimising the signal peptide for glycosyl phosphatidylinositol modification of human acetylcholinesterase using mutational analysis and peptide-quantitative structure-activity relationships. , 1999, Biochimica et biophysica acta.

[28]  A. Hénaut,et al.  Analysis and Predictions from Escherichia coli Sequences , or E . coli In Silico , 1999 .

[29]  R. Roberts Totally in vitro protein selection using mRNA-protein fusions and ribosome display. , 1999, Current opinion in chemical biology.

[30]  Arthur E. Hoerl,et al.  Ridge Regression: Biased Estimation for Nonorthogonal Problems , 2000, Technometrics.

[31]  M. Lehmann,et al.  From DNA sequence to improved functionality: using protein sequence comparisons to rapidly design a thermostable consensus phytase. , 2000, Protein engineering.

[32]  M. V. van Regenmortel Are there two distinct research strategies for developing biologically active molecules: rational design and empirical selection? , 2000, Journal of molecular recognition : JMR.

[33]  D. Ryu,et al.  Recent Progress in Biomolecular Engineering , 2000, Biotechnology progress.

[34]  C. Gustafsson,et al.  Directed evolution: the 'rational' basis for 'irrational' design. , 2000, Current opinion in structural biology.

[35]  Y Husimi,et al.  Analysis of a local fitness landscape with a model of the rough Mt. Fuji-type landscape: application to prolyl endopeptidase and thermolysin. , 2000, Biopolymers.

[36]  Y Husimi,et al.  A cross-section of the fitness landscape of dihydrofolate reductase. , 2001, Protein engineering.

[37]  J. Svendsen,et al.  Important structural features of 15-residue lactoferricin derivatives and methods for improvement of antimicrobial activity. , 2002, Biochemistry and cell biology = Biochimie et biologie cellulaire.

[38]  A. Paul,et al.  Chemical Synthesis of Poliovirus cDNA: Generation of Infectious Virus in the Absence of Natural Template , 2002, Science.

[39]  E. Oliveira,et al.  Kinetic characterization and inhibition of the rat MAB elastase-2, an angiotensin I-converting serine protease. , 2002, Canadian journal of physiology and pharmacology.

[40]  Jon E. Ness,et al.  Synthetic shuffling expands functional protein diversity by allowing amino acids to recombine independently , 2002, Nature Biotechnology.

[41]  P. Alexander,et al.  Structural Basis of Thermostability , 2002, The Journal of Biological Chemistry.

[42]  Yasuhiko Shibanaka,et al.  Surveying a local fitness landscape of a protein with epistatic sites for the study of directed evolution. , 2002, Biopolymers.

[43]  Claes Gustafsson,et al.  Systematic variation of amino acid substitutions for stringent assessment of pairwise covariation. , 2003, Journal of molecular biology.

[44]  Mk Warmuth,et al.  Active Learning with SVMs in the Drug Discovery Process , 2003 .

[45]  G. Schoch,et al.  Key substrate recognition residues in the active site of a plant cytochrome P450, CYP73A1. Homology guided site-directed mutagenesis. , 2003, European journal of biochemistry.

[46]  Gunnar Rätsch,et al.  Active Learning with Support Vector Machines in the Drug Discovery Process , 2003, J. Chem. Inf. Comput. Sci..

[47]  Bernhard Schölkopf,et al.  A tutorial on support vector regression , 2004, Stat. Comput..

[48]  J. Short,et al.  Gene site saturation mutagenesis: a comprehensive mutagenesis approach. , 2004, Methods in enzymology.

[49]  G. Church,et al.  Accurate multiplex gene synthesis from programmable DNA microchips , 2004, Nature.

[50]  O. White,et al.  Environmental Genome Shotgun Sequencing of the Sargasso Sea , 2004, Science.

[51]  Loren L Looger,et al.  Computational Design of a Biologically Active Enzyme , 2004, Science.

[52]  Sarah J Kodumal,et al.  Total synthesis of long DNA sequences: synthesis of a contiguous 32-kb polyketide synthase gene cluster. , 2004, Proceedings of the National Academy of Sciences of the United States of America.

[53]  L. Young,et al.  Two-step total gene synthesis method. , 2004, Nucleic acids research.

[54]  William J Welch,et al.  Comparison of methods based on diversity and similarity for molecule selection and the analysis of drug discovery data. , 2004, Methods in molecular biology.

[55]  C. A. Murthy,et al.  A probabilistic active support vector learning algorithm , 2004, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[56]  Ayhan Demiriz,et al.  Linear Programming Boosting via Column Generation , 2002, Machine Learning.

[57]  S. Govindarajan,et al.  Codon bias and heterologous protein expression. , 2004, Trends in biotechnology.

[58]  Yi Li,et al.  A simple, rapid, high-fidelity and cost-effective PCR-based two-step DNA synthesis method for long gene sequences. , 2004, Nucleic acids research.

[59]  Thomas P. Ryan Taguchi's Quality Engineering Handbook , 2005 .

[60]  B. Stoddard,et al.  Computational Thermostabilization of an Enzyme , 2005, Science.

[61]  Jon E. Ness,et al.  Empirical biocatalyst engineering : Escaping the tyranny of high-throughput screening , 2005 .

[62]  Claes Gustafsson,et al.  Predicting enzyme function from protein sequence. , 2005, Current opinion in chemical biology.

[63]  A. Halpern,et al.  A Sanger/pyrosequencing hybrid approach for the generation of high-quality draft assemblies of marine microbial genomes. , 2006, Proceedings of the National Academy of Sciences of the United States of America.

[64]  Jianwen Fang,et al.  Support Vector Machines in HTS Data Mining: Type I MetAPs Inhibition Study , 2006, Journal of biomolecular screening.