A metabolome pipeline: from concept to data to knowledge

Metabolomics, like other omics methods, produces huge datasets of biological variables, often accompanied by the necessary metadata. However, regardless of the form in which these are produced they are merely the ground substance for assisting us in answering biological questions. In this short tutorial review and position paper we seek to set out some of the elements of “best practice” in the optimal acquisition of such data, and in the means by which they may be turned into reliable knowledge. Many of these steps involve the solution of what amount to combinatorial optimization problems, and methods developed for these, especially those based on evolutionary computing, are proving valuable. This is done in terms of a “pipeline” that goes from the design of good experiments, through instrumental optimization, data storage and manipulation, the chemometric data processing methods in common use, and the necessary means of validation and cross-validation for giving conclusions that are credible and likely to be robust when applied in comparable circumstances to samples not used in their generation.

[1]  J J Rowland,et al.  Model selection methodology in supervised learning with evolutionary computation. , 2003, Bio Systems.

[2]  David G. Stork,et al.  Pattern Classification (2nd ed.) , 1999 .

[3]  Douglas B. Kell,et al.  Rapid and quantitative analysis and bioprocesses using pyrolysis mass spectrometry and neural networks: application to indole production , 1993 .

[4]  D. Kell,et al.  A functional genomics strategy that uses metabolome data to reveal the phenotype of silent mutations , 2001, Nature Biotechnology.

[5]  D. Kell,et al.  Here is the evidence, now what is the hypothesis? The complementary roles of inductive and hypothesis-driven science in the post-genomic era. , 2004, BioEssays : news and reviews in molecular, cellular and developmental biology.

[6]  Vladimir Estivill-Castro,et al.  Why so many clustering algorithms: a position paper , 2002, SKDD.

[7]  John R. Koza,et al.  Genetic programming - on the programming of computers by means of natural selection , 1993, Complex adaptive systems.

[8]  John R. Koza,et al.  Genetic Programming III: Darwinian Invention & Problem Solving , 1999 .

[9]  Heekuck Oh,et al.  Neural Networks for Pattern Recognition , 1993, Adv. Comput..

[10]  David G. Stork,et al.  Pattern Classification , 1973 .

[11]  D. Kell,et al.  Pyrolysis mass spectrometry and its applications in biotechnology. , 1996, Current opinion in biotechnology.

[12]  Leo Breiman,et al.  Random Forests , 2001, Machine Learning.

[13]  C. Chatfield Model uncertainty, data mining and statistical inference , 1995 .

[14]  I. Wilson,et al.  Hyphenation and hypernation the practice and prospects of multiple hyphenation. , 2003, Journal of chromatography. A.

[15]  Carole A. Goble,et al.  Conceptual modelling of genomic information , 2000, Bioinform..

[16]  D. Ruppert The Elements of Statistical Learning: Data Mining, Inference, and Prediction , 2004 .

[17]  J. Lindon,et al.  Metabonomics: a platform for studying drug toxicity and gene function , 2002, Nature Reviews Drug Discovery.

[18]  Hilary E. Tillett Bradford Hill's Principles of Medical Statistics. , 1992 .

[19]  James J Schlesselman Case-Control Studies: Design, Conduct, Analysis , 1982 .

[20]  M. Kenward,et al.  An Introduction to the Bootstrap , 2007 .

[21]  D. Fell Understanding the Control of Metabolism , 1996 .

[22]  S. Dudoit,et al.  A prediction-based resampling method for estimating the number of clusters in a dataset , 2002, Genome Biology.

[23]  M. Kramer Nonlinear principal component analysis using autoassociative neural networks , 1991 .

[24]  Rolf Apweiler,et al.  The Proteomics Standards Initiative , 2003, Proteomics.

[25]  Athel Cornish-Bowden,et al.  Functional genomics: Silent genes given voice , 2001, Nature.

[26]  Pallab Dasgupta,et al.  Multiobjective Heuristic Search , 1999, Computational Intelligence.

[27]  O. Fiehn,et al.  Interpreting correlations in metabolomic networks. , 2003, Biochemical Society transactions.

[28]  Joseph Silk,et al.  The Left Hand of Creation: The Origin and Evolution of the Expanding Universe , 1984 .

[29]  Douglas B. Kell,et al.  Metabolomics and Machine Learning: Explanatory Analysis of Complex Metabolome Data Using Genetic Programming to Produce Simple, Robust Rules , 2004, Molecular Biology Reports.

[30]  J. Ross Quinlan,et al.  C4.5: Programs for Machine Learning , 1992 .

[31]  D B Kell,et al.  Genetic programming:  a novel method for the quantitative analysis of pyrolysis mass spectral data. , 1997, Analytical chemistry.

[32]  Royston Goodacre,et al.  Neural networks and olive oil , 1992, Nature.

[33]  Victor J. Rayward-Smith,et al.  Modern Heuristic Search Methods , 1996 .

[34]  R. Goodacre,et al.  Metabolic fingerprinting of salt-stressed tomatoes. , 2003, Phytochemistry.

[35]  P. Sopp Cluster analysis. , 1996, Veterinary immunology and immunopathology.

[36]  Dr. Zbigniew Michalewicz,et al.  How to Solve It: Modern Heuristics , 2004 .

[37]  John C. Lindon,et al.  Metabonomics: metabolic processes studied by NMR spectroscopy of biofluids , 2000 .

[38]  Douglas B. Kell,et al.  Explanatory Analysis of the Metabolome Using Genetic Programming of Simple, Interpretable Rules , 2000, Genetic Programming and Evolvable Machines.

[39]  Leo Breiman,et al.  Statistical Modeling: The Two Cultures (with comments and a rejoinder by the author) , 2001 .

[40]  Athel Cornish-Bowden Silent genes given voice , 2001 .

[41]  F. Glover,et al.  In Modern Heuristic Techniques for Combinatorial Problems , 1993 .

[42]  Jeffrey Horn,et al.  Handbook of evolutionary computation , 1997 .

[43]  Yoshua Bengio,et al.  Pattern Recognition and Neural Networks , 1995 .

[44]  R. J. Gilbert,et al.  Efficient Improvement of Silage Additives by Using Genetic Algorithms , 2000, Applied and Environmental Microbiology.

[45]  R. Fleischmann,et al.  Whole-genome random sequencing and assembly of Haemophilus influenzae Rd. , 1995, Science.

[46]  R. A. Fisher,et al.  Design of Experiments , 1936 .

[47]  Una-May O'Reilly,et al.  Genetic Programming II: Automatic Discovery of Reusable Programs. , 1994, Artificial Life.

[48]  Pedro Mendes,et al.  Emerging bioinformatics for the metabolome , 2002, Briefings Bioinform..

[49]  Anthony F. P. Nash,et al.  A 1H NMR-based metabonomic study of urine and plasma samples obtained from healthy human subjects. , 2003, Journal of pharmaceutical and biomedical analysis.

[50]  Riccardo Poli,et al.  New ideas in optimization , 1999 .

[51]  Michalis Vazirgiannis,et al.  c ○ 2001 Kluwer Academic Publishers. Manufactured in The Netherlands. On Clustering Validation Techniques , 2022 .

[52]  Douglas C. Montgomery,et al.  Response Surface Methodology: Process and Product Optimization Using Designed Experiments , 1995 .

[53]  Kathleen Marchal,et al.  Adaptive quality-based clustering of gene expression profiles , 2002, Bioinform..

[54]  Douglas B. Kell,et al.  Functional Genomics Via Metabolic Footprinting: Monitoring Metabolite Secretion by Escherichia Coli Tryptophan Metabolism Mutants Using FT–IR and Direct Injection Electrospray Mass Spectrometry , 2003, Comparative and functional genomics.

[55]  Douglas B. Kell,et al.  Metabolic control theory: its role in microbiology and biotechnology , 1986 .

[56]  John C. Lindon,et al.  Peer Reviewed: So What’s the Deal with Metabonomics? , 2003 .

[57]  Andrew Hayes,et al.  GIMS: an integrated data storage and analysis environment for genomic and functional data , 2003, Yeast.

[58]  Alex M. Andrew,et al.  Modern Heuristic Search Methods , 1998 .

[59]  Ross D. King,et al.  Application of metabolomics to plant genotype discrimination using statistics and machine learning , 2002, ECCB.

[60]  Claude E. Shannon,et al.  The Mathematical Theory of Communication , 1950 .

[61]  Nigel Hardy,et al.  Databases, Data Modeling and Schemas , 2003 .

[62]  Jürgen Kurths,et al.  Observing and Interpreting Correlations in Metabolic Networks , 2003, Bioinform..

[63]  Dianjing Guo,et al.  Databases and Visualization for Metabolomics , 2003 .

[64]  Casimir A. Kulikowski,et al.  Computer Systems That Learn: Classification and Prediction Methods from Statistics, Neural Nets, Machine Learning and Expert Systems , 1990 .

[65]  D B Kell,et al.  Genomic computing. Explanatory analysis of plant expression profiling data using machine learning. , 2001, Plant physiology.

[66]  Emmanuel Barillot,et al.  XML, bioinformatics and data integration , 2001, Bioinform..

[67]  B. Kowalski,et al.  The parsimony principle applied to multivariate calibration , 1993 .

[68]  Thomas G. Dietterich Multiple Classifier Systems , 2000, Lecture Notes in Computer Science.

[69]  Royston Goodacre,et al.  Evolutionary computation for the interpretation of metabolomic data. , 2003 .

[70]  Thomas G. Dietterich What is machine learning? , 2020, Archives of Disease in Childhood.

[71]  Anil K. Jain,et al.  Algorithms for Clustering Data , 1988 .

[72]  Royston Goodacre,et al.  Explanatory analysis of spectroscopic data using machine learning of simple, interpretable rules , 2003 .

[73]  R. Dixon,et al.  Plant metabolomics: large-scale phytochemistry in the functional genomics era. , 2003, Phytochemistry.

[74]  Royston Goodacre,et al.  Metabolic Fingerprinting with Fourier Transform Infrared Spectroscopy , 2003 .

[75]  Andrew M Woodward,et al.  Fast automatic registration of images using the phase of a complex wavelet transform: application to proteome gels. , 2004, The Analyst.

[76]  C. Ireland Fundamental concepts in the design of experiments , 1964 .

[77]  Shigeo Abe DrEng Pattern Classification , 2001, Springer London.

[78]  R. D'ari Systematic functional analysis of the yeast genome , 1998 .

[79]  Margaret J. Robertson,et al.  Design and Analysis of Experiments , 2006, Handbook of statistics.

[80]  C. Wandrey,et al.  Medium Optimization by Genetic Algorithm for Continuous Production of Formate Dehydrogenase , 1995 .

[81]  H. Riedwyl,et al.  Multivariate Statistics: A Practical Approach , 1988 .

[82]  Anil K. Jain,et al.  Data clustering: a review , 1999, CSUR.

[83]  J. Foster Computational genetics: Evolutionary computation , 2001, Nature Reviews Genetics.

[84]  Teuvo Kohonen,et al.  Self-Organization and Associative Memory , 1988 .

[85]  D. Kell,et al.  Explanatory optimization of protein mass spectrometry via genetic search. , 2003, Analytical chemistry.

[86]  D. Kell,et al.  Metabolic profiling using direct infusion electrospray ionisation mass spectrometry for the characterisation of olive oils. , 2002, The Analyst.

[87]  David J. Livingstone,et al.  Data analysis for chemists , 1995 .

[88]  Hilla Peretz,et al.  The , 1966 .

[89]  M. Forina,et al.  Multivariate calibration. , 2007, Journal of chromatography. A.

[90]  John R. Koza Genetic Programming III - Darwinian Invention and Problem Solving , 1999, Evolutionary Computation.

[91]  Ela Hunt,et al.  An object model and database for functional genomics , 2004, Bioinform..

[92]  Hiroaki Kitano,et al.  The systems biology markup language (SBML): a medium for representation and exchange of biochemical network models , 2003, Bioinform..

[93]  D. Kell,et al.  Selective detection of proteins in mixtures using electrospray ionization mass spectrometry: influence of instrumental settings and implications for proteomics. , 2004, Analytical chemistry.

[94]  Wolfgang Banzhaf,et al.  Genetic Programming: An Introduction , 1997 .

[95]  M. Tristem Molecular Evolution — A Phylogenetic Approach. , 2000, Heredity.

[96]  J. Nicholson,et al.  Application of biofluid 1H nuclear magnetic resonance-based metabonomic techniques for the analysis of the biochemical effects of dietary isoflavones on human plasma profile. , 2003, Analytical biochemistry.

[97]  D. Kell,et al.  Metabolomics by numbers: acquiring and understanding global metabolite data. , 2004, Trends in biotechnology.

[98]  David J. Spiegelhalter,et al.  Machine Learning, Neural and Statistical Classification , 2009 .

[99]  D. Kell Metabolomics and systems biology: making sense of the soup. , 2004, Current opinion in microbiology.

[100]  J. Nicholson,et al.  Rapid and noninvasive diagnosis of the presence and severity of coronary heart disease using 1H-NMR-based metabonomics , 2002, Nature Medicine.

[101]  R. Goodacre,et al.  Metabolic Profiling: Its Role in Biomarker Discovery and Gene Function Analysis , 2003, Springer US.

[102]  J. Selbig,et al.  Parallel analysis of transcript and metabolic profiles: a new approach in systems biology , 2003, EMBO reports.

[103]  Estivill-CastroVladimir Why so many clustering algorithms , 2002 .

[104]  Robert Tibshirani,et al.  Estimating the number of clusters in a data set via the gap statistic , 2000 .

[105]  Riccardo Poli,et al.  Foundations of Genetic Programming , 1999, Springer Berlin Heidelberg.

[106]  Douglas B. Kell,et al.  Discrimination of Modes of Action of Antifungal Substances by Use of Metabolic Footprinting , 2004, Applied and Environmental Microbiology.

[107]  Douglas B. Kell,et al.  GENETIC PROGRAMMING AS AN ANALYTICAL TOOL FOR METABOLOME DATA , 1999 .

[108]  Jian Yang,et al.  Metabolomics spectral formatting, alignment and conversion tools (MSFACTs) , 2003, Bioinform..

[109]  Oliver Fiehn,et al.  Deciphering metabolic networks. , 2003, European journal of biochemistry.

[110]  William B. Langdon,et al.  Genetic Programming and Data Structures: Genetic Programming + Data Structures = Automatic Programming! , 1998 .

[111]  Ute Roessner,et al.  Simultaneous analysis of metabolites in potato tuber by gas chromatography-mass spectrometry. , 2000 .

[112]  Roberto Battiti,et al.  Using mutual information for selecting features in supervised neural net learning , 1994, IEEE Trans. Neural Networks.

[113]  R. Heinrich,et al.  The Regulation of Cellular Systems , 1996, Springer US.

[114]  Anna,et al.  Rapid Assessment of the Adulteration of Virgin Olive Oils by Other Seed Oils Using Pyrolysis Mass Spectrometry and Artificial Neural Networks , 1993 .

[115]  F Baganz,et al.  Systematic functional analysis of the yeast genome. , 1998, Trends in biotechnology.

[116]  John R. Koza,et al.  Genetic Programming IV: Routine Human-Competitive Machine Intelligence , 2003 .

[117]  O. Fiehn,et al.  Metabolite profiling for plant functional genomics , 2000, Nature Biotechnology.

[118]  John S. J. Hsu,et al.  Bayesian Methods: An Analysis for Statisticians and Interdisciplinary Researchers , 1999 .

[119]  Brian D. Ripley,et al.  Pattern Recognition and Neural Networks , 1996 .

[120]  Chris F. Taylor,et al.  A systematic approach to modeling, capturing, and disseminating proteomics experimental data , 2003, Nature Biotechnology.

[121]  Henrik Antti,et al.  Contemporary issues in toxicology the role of metabonomics in toxicology and its evaluation by the COMET project. , 2003, Toxicology and applied pharmacology.

[122]  Eric Bauer,et al.  An Empirical Comparison of Voting Classification Algorithms: Bagging, Boosting, and Variants , 1999, Machine Learning.

[123]  Chester Hartman,et al.  Rejoinder by the Author , 1965 .

[124]  Nikolay I. Nikolaev,et al.  Genetic Programming and Data Structures: Genetic Programming+Data Structures=Automatic Programming , 2001, Softw. Focus.

[125]  A. Cornish-Bowden,et al.  Co-response analysis: a new experimental strategy for metabolic control analysis. , 1996, Journal of theoretical biology.

[126]  Joshua D. Knowles,et al.  Evolutionary Multiobjective Clustering , 2004, PPSN.

[127]  Alisdair R. Fernie,et al.  Review: Metabolome characterisation in plant system analysis. , 2003, Functional plant biology : FPB.

[128]  Oliver Fiehn,et al.  Combining Genomics, Metabolome Analysis, and Biochemical Modelling to Understand Metabolic Networks , 2001, Comparative and functional genomics.

[129]  K. Rothman Epidemiology: An Introduction , 2002 .

[130]  O. Fiehn Metabolomics – the link between genotypes and phenotypes , 2004, Plant Molecular Biology.

[131]  I. Wilson,et al.  Understanding 'Global' Systems Biology: Metabonomics and the Continuum of Metabolism , 2003, Nature Reviews Drug Discovery.

[132]  Joshua D. Knowles,et al.  Closed-loop, multiobjective optimization of analytical instrumentation: gas chromatography/time-of-flight mass spectrometry of the metabolomes of human serum and of yeast fermentations. , 2005, Analytical chemistry.

[133]  Leonie Kohl,et al.  Fundamental Concepts in the Design of Experiments , 2000 .

[134]  C. Reeves Modern heuristic techniques for combinatorial problems , 1993 .

[135]  M. Graffar [Modern epidemiology]. , 1971, Bruxelles medical.

[136]  Leo Breiman,et al.  Statistical Modeling: The Two Cultures (with comments and a rejoinder by the author) , 2001, Statistical Science.

[137]  O. Fiehn,et al.  Use of metabolomics to discover metabolic patterns associated with human diseases , 2003 .

[138]  T. Söderström,et al.  On the parsimony principle , 1982 .

[139]  Thomas Linke,et al.  Visualizing plant metabolomic correlation networks using clique-metabolite matrices , 2001, Bioinform..

[140]  R. King,et al.  On the optimization of classes for the assignment of unidentified reading frames in functional genomics programmes: the need for machine learning. , 2000, Trends in biotechnology.

[141]  M. Greenwood An Introduction to Medical Statistics , 1932, Nature.

[142]  Nigel W. Hardy,et al.  A proposed framework for the description of plant metabolomics experiments and their results , 2004, Nature Biotechnology.

[143]  Jason E. Stewart,et al.  Minimum information about a microarray experiment (MIAME)—toward standards for microarray data , 2001, Nature Genetics.

[144]  J. Pearl Causality: Models, Reasoning and Inference , 2000 .

[145]  S. T. Buckland,et al.  An Introduction to the Bootstrap , 1994 .

[146]  D. Goodenowe,et al.  Nontargeted metabolome analysis by use of Fourier Transform Ion Cyclotron Mass Spectrometry. , 2002, Omics : a journal of integrative biology.

[147]  D. Kell,et al.  High-throughput classification of yeast mutants for functional genomics using metabolic footprinting , 2003, Nature Biotechnology.

[148]  Laurian M. Chirica,et al.  The entity-relationship model: toward a unified view of data , 1975, SIGF.

[149]  D. Hawkins Multivariate Statistics: A Practical Approach , 1990 .