The Racing Algorithm: Model Selection for Lazy Learners

Given a set of models and some training data, we would like to find the model that best describes the data. Finding the model with the lowest generalization error is a computationally expensive process, especially if the number of testing points is high or if the number of models is large. Optimization techniques such as hill climbing or genetic algorithms are helpful but can end up with a model that is arbitrarily worse than the best one or cannot be used because there is no distance metric on the space of discrete models. In this paper we develop a technique called “racing” that tests the set of models in parallel, quickly discards those models that are clearly inferior and concentrates the computational effort on differentiating among the better models. Racing is especially suitable for selecting among lazy learners since training requires negligible expense, and incremental testing using leave-one-out cross validation is efficient. We use racing to select among various lazy learning algorithms and to find relevant features in applications ranging from robot juggling to lesion detection in MRI scans.

[1]  B. L. Welch THE SIGNIFICANCE OF THE DIFFERENCE BETWEEN TWO MEANS WHEN THE POPULATION VARIANCES ARE UNEQUAL , 1938 .

[2]  W. Hoeffding Probability Inequalities for sums of Bounded Random Variables , 1963 .

[3]  Samuel A. Schmitt Measuring Uncertainty: An Elementary Introduction to Bayesian Statistics , 1969 .

[4]  Samuel D. Conte,et al.  Elementary Numerical Analysis: An Algorithmic Approach , 1975 .

[5]  S. Addelman Statistics for experimenters , 1978 .

[6]  Samuel D. Conte,et al.  Elementary Numerical Analysis , 1980 .

[7]  C. D. Gelatt,et al.  Optimization by Simulated Annealing , 1983, Science.

[8]  Anne Lohrli Chapman and Hall , 1985 .

[9]  David E. Goldberg,et al.  Genetic Algorithms in Search Optimization and Machine Learning , 1988 .

[10]  W. Cleveland,et al.  Regression by local fitting: Methods, properties, and computational algorithms , 1988 .

[11]  J. L. Hodges,et al.  Discriminatory Analysis - Nonparametric Discrimination: Consistency Properties , 1989 .

[12]  F. A. Seiler,et al.  Numerical Recipes in C: The Art of Scientific Computing , 1989 .

[13]  David W. Aha,et al.  A study of instance-based algorithms for supervised learning tasks: mathematical, empirical, and psychological evaluations , 1990 .

[14]  Casimir A. Kulikowski,et al.  Computer Systems That Learn: Classification and Prediction Methods from Statistics, Neural Nets, Machine Learning and Expert Systems , 1990 .

[15]  Alan J. Miller,et al.  Subset Selection in Regression , 1991 .

[16]  Andrew W. Moore,et al.  Fast, Robust Adaptive Control by Learning only Forward Models , 1991, NIPS.

[17]  B Efron,et al.  Statistical Data Analysis in the Computer Age , 1991, Science.

[18]  Léon Bottou,et al.  Local Learning Algorithms , 1992, Neural Computation.

[19]  A. Atkinson Subset Selection in Regression , 1992 .

[20]  J. Mesirov,et al.  Hybrid system for protein secondary structure prediction. , 1992, Journal of molecular biology.

[21]  Daniel N. Hill,et al.  An Empirical Investigation of Brute Force to choose Features, Smoothers and Function Approximators , 1992 .

[22]  Russell Greiner,et al.  A Statistical Approach to Solving the EBL Utility Problem , 1992, AAAI.

[23]  William H. Press,et al.  Numerical recipes in C++: the art of scientific computing, 2nd Edition (C++ ed., print. is corrected to software version 2.10) , 1994 .

[24]  David Haussler,et al.  Decision Theoretic Generalizations of the PAC Model for Neural Net and Other Learning Applications , 1992, Inf. Comput..

[25]  Leslie Pack Kaelbling,et al.  Learning in embedded systems , 1993 .

[26]  Andrew W. Moore,et al.  Hoeffding Races: Accelerating Model Selection Search for Classification and Function Approximation , 1993, NIPS.

[27]  Stefan Schaal,et al.  Open loop stable control strategies for robot juggling , 1993, [1993] Proceedings IEEE International Conference on Robotics and Automation.

[28]  Gerald DeJong,et al.  Learning Search Control Knowledge for Deep Space Network Scheduling , 1993, ICML.

[29]  David B. Skalak,et al.  Prototype and Feature Selection by Sampling and Random Mutation Hill Climbing Algorithms , 1994, ICML.

[30]  Oded Maron,et al.  Hoeffding Races--model selection for MRI classification , 1994 .

[31]  J. F. Kreider Prediction Hourly Building Energy Use : The Great Energy Predictor Shootout - Overview and Discussion of Results , 1994 .

[32]  Andrew W. Moore,et al.  Efficient Algorithms for Minimizing Cross Validation Error , 1994, ICML.

[33]  N. Mathews,et al.  An Effective Method for Correlated Selection Problems , 1994 .

[34]  Ronald L. Rivest,et al.  Simulation results for a new two-armed bandit heuristic , 1994, Annual Conference Computational Learning Theory.

[35]  Ron Kohavi,et al.  Irrelevant Features and the Subset Selection Problem , 1994, ICML.

[36]  Rich Caruana,et al.  Greedy Attribute Selection , 1994, ICML.

[37]  David G. Lowe,et al.  Similarity Metric Learning for a Variable-Kernel Classifier , 1995, Neural Computation.

[38]  Russell Greiner,et al.  Computational learning theory and natural learning systems , 1997 .

[39]  Catherine Blake,et al.  UCI Repository of machine learning databases , 1998 .

[40]  Andrew W. Moore,et al.  Locally Weighted Learning , 1997, Artificial Intelligence Review.