Active Learning with Support Vector Machines in the Drug Discovery Process

We investigate the following data mining problem from computer-aided drug design: From a large collection of compounds, find those that bind to a target molecule in as few iterations of biochemical testing as possible. In each iteration a comparatively small batch of compounds is screened for binding activity toward this target. We employed the so-called "active learning paradigm" from Machine Learning for selecting the successive batches. Our main selection strategy is based on the maximum margin hyperplane-generated by "Support Vector Machines". This hyperplane separates the current set of active from the inactive compounds and has the largest possible distance from any labeled compound. We perform a thorough comparative study of various other selection strategies on data sets provided by DuPont Pharmaceuticals and show that the strategies based on the maximum margin hyperplane clearly outperform the simpler ones.

[1]  David Saad,et al.  Learning from queries for maximum information gain in imperfectly learnable problems , 1994, NIPS.

[2]  Bernard F. Buxton,et al.  Drug Design by Machine Learning: Support Vector Machines for Pharmaceutical Data Analysis , 2001, Comput. Chem..

[3]  Gunnar Rätsch,et al.  An introduction to kernel-based learning algorithms , 2001, IEEE Trans. Neural Networks.

[4]  Vladimir N. Vapnik,et al.  The Nature of Statistical Learning Theory , 2000, Statistics for Engineering and Information Science.

[5]  Christopher J. C. Burges,et al.  A Tutorial on Support Vector Machines for Pattern Recognition , 1998, Data Mining and Knowledge Discovery.

[6]  Eli Shamir,et al.  Learning Using Query by Committee, Linear Separation and Random Walks , 2002 .

[7]  David A. Cohn,et al.  Active Learning with Statistical Models , 1996, NIPS.

[8]  Christian Lemmen,et al.  A Novel Shape-Feature Based Approach to Virtual Library Screening , 2002, J. Chem. Inf. Comput. Sci..

[9]  D. Angluin Queries and Concept Learning , 1988 .

[10]  Alexander J. Smola,et al.  Learning with kernels , 1998 .

[11]  Nello Cristianini,et al.  Query Learning with Large Margin Classifiers , 2000, ICML.

[12]  Thorsten Joachims,et al.  Making large scale SVM learning practical , 1998 .

[13]  Thomas G. Dietterich What is machine learning? , 2020, Archives of Disease in Childhood.

[14]  Bernhard E. Boser,et al.  A training algorithm for optimal margin classifiers , 1992, COLT '92.

[15]  David A. Cohn,et al.  Training Connectionist Networks with Queries and Selective Sampling , 1989, NIPS.

[16]  Daphne Koller,et al.  Support Vector Machine Active Learning with Applications to Text Classification , 2002, J. Mach. Learn. Res..

[17]  Gunnar Rätsch,et al.  Active Learning in the Drug Discovery Process , 2001, NIPS.

[18]  Christian Lemmen,et al.  Coupling structure-based design with combinatorial chemistry: application of active site derived pharmacophores with informative library design. , 2002, Journal of molecular graphics & modelling.

[19]  Bernhard Schölkopf,et al.  Feature selection and transduction for prediction of molecular bioactivity for drug design , 2003, Bioinform..

[20]  Daphne Koller,et al.  Support Vector Machine Active Learning with Application sto Text Classification , 2000, ICML.