Support Vector Machines for Active Learning in the Drug Discovery Process

We investigate the following data mining problem from computeraided drug design: From a large collection of compounds, find those that bind to a target molecule in as few iterations of biochemical testing as possible. In each iteration a comparatively small batch of compounds is screened for binding activity towards this target. We employed the so-called “active learning paradigm” from Machine Learning for selecting the successive batches. Our main selection strategy is based on the maximum margin hyperplane – generated by “Support Vector Machines”. This hyperplane separates the current set of active from the inactive compounds and has the largest possible distance from any labeled compound. We perform a thorough comparative study of various other selection strategies on data sets provided by DuPont Pharmaceuticals and show that the strategies based on the maximum margin hyperplane clearly outperform the simpler ones. ∗Part of this work has been presented at QSAR Gordon Conference in Tilton, NH, USA (August 2001). An extended abstract which emphasizes the Machine Learning aspects of our work and compares a large number of selection strategies appeared in the proceedings of the NIPS 2001 conference [1]. †Computer Science Dept., University of California, Santa Cruz, CA 94065, USA ‡Corresponding author, Email: manfred@cse.ucsc.edu, Tel: +1 831 459 4950 §RSISE, Australian National University, ACT 0200, Canberra, Australia ¶Deltagen Research Labs, 740 Bay Road, Redwood City, CA 94063, USA ‖BioSolveIT GmbH, An der Ziegelei 75, 53757 Sankt Augustin, Germany

[1]  Daphne Koller,et al.  Support Vector Machine Active Learning with Applications to Text Classification , 2002, J. Mach. Learn. Res..

[2]  Nello Cristianini,et al.  Query Learning with Large Margin Classifiers , 2000, ICML.

[3]  Aaas News,et al.  Book Reviews , 1893, Buffalo Medical and Surgical Journal.

[4]  Alexander J. Smola,et al.  Learning with kernels , 1998 .

[5]  Eli Shamir,et al.  Learning Using Query by Committee, Linear Separation and Random Walks , 2002 .

[6]  A. W.,et al.  Journal of chemical information and computer sciences. , 1995, Environmental science & technology.

[7]  Vladimir Naumovich Vapni The Nature of Statistical Learning Theory , 1995 .

[8]  Bernhard Schölkopf,et al.  Feature selection and transduction for prediction of molecular bioactivity for drug design , 2003, Bioinform..

[9]  Daphne Koller,et al.  Support Vector Machine Active Learning with Application sto Text Classification , 2000, ICML.

[10]  David A. Cohn,et al.  Training Connectionist Networks with Queries and Selective Sampling , 1989, NIPS.

[11]  Thomas G. Dietterich What is machine learning? , 2020, Archives of Disease in Childhood.

[12]  Gunnar Rätsch,et al.  Active Learning in the Drug Discovery Process , 2001, NIPS.

[13]  Thorsten Joachims,et al.  Making large scale SVM learning practical , 1998 .

[14]  David Saad,et al.  Learning from queries for maximum information gain in imperfectly learnable problems , 1994, NIPS.

[15]  G. G. Stokes "J." , 1890, The New Yale Book of Quotations.

[16]  Bernhard E. Boser,et al.  A training algorithm for optimal margin classifiers , 1992, COLT '92.

[17]  David A. Cohn,et al.  Active Learning with Statistical Models , 1996, NIPS.