Active Sampling for Knowledge Discovery from Biomedical Data

We describe work aimed at cost-constrained knowledge discovery in the biomedical domain. To improve the diagnostic/prognostic models of cancer, new biomarkers are studied by researchers that might provide predictive information. Biological samples from monitored patients are selected and analyzed for determining the predictive power of the biomarker. During the process of biomarker evaluation, portions of the samples are consumed, limiting the number of measurements that can be performed. The biological samples obtained from carefully monitored patients, that are well annotated with pathological information, are a valuable resource that must be conserved. We present an active sampling algorithm derived from statistical first principles to incrementally choose the samples that are most informative in estimating the efficacy of the candidate biomarker. We provide empirical evidence on real biomedical data that our active sampling algorithm requires significantly fewer samples than random sampling to ascertain the efficacy of the new biomarker.

[1]  Mattia Barbareschi,et al.  TMABoost: an integrated system for comprehensive management of tissue microarray data , 2006, IEEE Transactions on Information Technology in Biomedicine.

[2]  Mark A Rubin,et al.  Prospective evaluation of AMACR (P504S) and basal cell markers in the assessment of routine prostate needle biopsy specimens. , 2004, Human pathology.

[3]  W. J. Studden,et al.  Theory Of Optimal Experiments , 1972 .

[4]  Daphne Koller,et al.  Support Vector Machine Active Learning with Application sto Text Classification , 2000, ICML.

[5]  Foster J. Provost,et al.  Active Sampling for Class Probability Estimation and Ranking , 2004, Machine Learning.

[6]  Daphne Koller,et al.  Support Vector Machine Active Learning with Applications to Text Classification , 2002, J. Mach. Learn. Res..

[7]  David A. Cohn,et al.  Active Learning with Statistical Models , 1996, NIPS.

[8]  Zhiqiang Zheng,et al.  On active learning for data acquisition , 2002, 2002 IEEE International Conference on Data Mining, 2002. Proceedings..

[9]  K. Chaloner,et al.  Bayesian Experimental Design: A Review , 1995 .

[10]  Russell Greiner,et al.  Budgeted learning of nailve-bayes classifiers , 2002, UAI 2002.

[11]  J. Kononen,et al.  Tissue microarrays for high-throughput molecular profiling of tumor specimens , 1998, Nature Medicine.

[12]  H. Sebastian Seung,et al.  Query by committee , 1992, COLT '92.

[13]  G. Gray,et al.  Targeted therapy for cancer: the HER-2/neu and Herceptin story. , 2003, Clinical leadership & management review : the journal of CLMA.

[14]  H. Wynn,et al.  Maximum entropy sampling and optimal Bayesian experimental design , 2000 .

[15]  Russell Greiner,et al.  Budgeted Learning of Naive-Bayes Classifiers , 2003, UAI.

[16]  J. Mesirov,et al.  Molecular classification of cancer: class discovery and class prediction by gene expression monitoring. , 1999, Science.

[17]  David J. C. MacKay,et al.  Information-Based Objective Functions for Active Data Selection , 1992, Neural Computation.