Parallel EDAs to create multivariate calibration models for quantitative chemical applications

This paper describes the application of a collection of data mining methods to solve a calibration problem in a quantitative chemistry environment. Experimental data obtained from reactions which involve known concentrations of two or more components are used to calibrate a model that, later, will be used to predict the (unknown) concentrations of those components in a new reaction. This problem can be seen as a selection + prediction one, where the goal is to obtain good values for the variables to predict while minimizing the number of the input variables needed, taking a small subset of really significant ones. Initial approaches to the problem were principal components analysis and filtering combined with two prediction techniques: artificial neural networks and partial least squares regression. Finally, a parallel estimation of distribution algorithm was used to reduce the number of variables to be used for prediction, yielding the best models for all the considered problems.

[1]  A. A. Mullin,et al.  Principles of neurodynamics , 1962 .

[2]  Núria Villegas Forn Desenvolupament de procediments cinètics per l'anàlisi de multicomponents , 2003 .

[3]  Ron Wehrens,et al.  The pls Package: Principal Component and Partial Least Squares Regression in R , 2007 .

[4]  Marcelo Blanco,et al.  Use of circular dichroism and artificial neural networks for the kinetic-spectrophotometric resolution of enantiomers , 2001 .

[5]  L. A. Smith,et al.  Feature Subset Selection: A Correlation Based Filter Approach , 1997, ICONIP.

[6]  Pedro Larrañaga,et al.  Feature Subset Selection by Estimation of Distribution Algorithms , 2002, Estimation of Distribution Algorithms.

[7]  J. A. Lozano,et al.  Estimation of Distribution Algorithms: A New Tool for Evolutionary Computation , 2001 .

[8]  R. Leardi,et al.  Genetic algorithms applied to feature selection in PLS regression: how and when to use them , 1998 .

[9]  H. Mühlenbein,et al.  From Recombination of Genes to the Estimation of Distributions I. Binary Parameters , 1996, PPSN.

[10]  Usama M. Fayyad,et al.  Multi-Interval Discretization of Continuous-Valued Attributes for Classification Learning , 1993, IJCAI.

[11]  Erick Cantú-Paz,et al.  Feature Subset Selection, Class Separability, and Genetic Algorithms , 2004, GECCO.

[12]  Erick Cantú-Paz,et al.  Efficient and Accurate Parallel Genetic Algorithms , 2000, Genetic Algorithms and Evolutionary Computation.

[13]  Nir Friedman,et al.  On the Sample Complexity of Learning Bayesian Networks , 1996, UAI.

[14]  Geoffrey E. Hinton,et al.  Learning representations by back-propagation errors, nature , 1986 .

[15]  David E. Goldberg,et al.  A Survey of Optimization by Building and Using Probabilistic Models , 2002, Comput. Optim. Appl..

[16]  Carlos Ubide,et al.  New way of application of the bromate-bromide mixture in kinetic analysis , 2001 .

[17]  Geoffrey E. Hinton,et al.  Learning representations by back-propagating errors , 1986, Nature.

[18]  Pedro Larrañaga,et al.  Feature Subset Selection by Bayesian network-based optimization , 2000, Artif. Intell..

[19]  Alexander Mendiburu,et al.  Parallel implementation of EDAs based on probabilistic graphical models , 2005, IEEE Transactions on Evolutionary Computation.

[20]  Kimito Funatsu,et al.  GA Strategy for Variable Selection in QSAR Studies: Application of GA-Based Region Selection to a 3D-QSAR Study of Acetylcholinesterase Inhibitors , 1999, J. Chem. Inf. Comput. Sci..

[21]  Erick Cantú-Paz,et al.  Feature Subset Selection by Estimation of Distribution Algorithms , 2002, GECCO.

[22]  Moshe Ben-Bassat,et al.  35 Use of distance measures, information measures and error bounds in feature evaluation , 1982, Classification, Pattern Recognition and Reduction of Dimensionality.

[23]  Alexander Mendiburu,et al.  Parallel and multi-objective EDAs to create multivariate calibration models for quantitative chemical applications , 2005, 2005 International Conference on Parallel Processing Workshops (ICPPW'05).

[24]  Hiroshi Motoda,et al.  Feature Selection for Knowledge Discovery and Data Mining , 1998, The Springer International Series in Engineering and Computer Science.

[25]  S. Wold,et al.  PLS-regression: a basic tool of chemometrics , 2001 .

[26]  Pedro Larrañaga,et al.  Estimation of Distribution Algorithms , 2002, Genetic Algorithms and Evolutionary Computation.

[27]  Carlos Ubide,et al.  Multicomponent determinations using addition-generated reagent profiles and partial least squares regression , 2005 .

[28]  G. Schwarz Estimating the Dimension of a Model , 1978 .