GPU-Accelerated Parameter Optimization for Classification Rule Learning

While some studies comparing rule-based classifiers enumerate a parameter over several values, most use all default values, presumably due to the high computational cost of jointly tuning multiple parameters. We show that thorough, joint optimization of search parameters on individual datasets gives higher out-ofsample precision than fixed baselines. We test on 1,000 relatively large synthetic datasets with widely-varying properties. We optimize heuristic beam search with the m-estimate interestingness measure. We jointly tune m, the beam size, and the maximum rule length. The beam size controls the extent of search, where oversearching can find spurious rules. m controls the bias toward higher-frequency rules, with the optimal value depending on the amount of noise in the dataset. We assert that such hyper-parameters affecting the frequency bias and extent of search should be optimized simultaneously, since both directly affect the false-discovery rate. While our method based on grid search and crossvalidation is computationally intensive, we show that it can be massively parallelized, with our GPU implementation providing up to 28x speedup over a comparable multi-threaded CPU implementation.

[1]  William W. Cohen Fast Effective Rule Induction , 1995, ICML.

[2]  Paul R. Cohen,et al.  Multiple Comparisons in Induction Algorithms , 2000, Machine Learning.

[3]  Mark Harman,et al.  Non-Recursive Beam Search on GPU for Formal Concept Analysis , 2011 .

[4]  Patrick Meyer,et al.  On selecting interestingness measures for association rules: User oriented description and multiple criteria decision aid , 2008, Eur. J. Oper. Res..

[5]  Johannes Fürnkranz,et al.  On the quest for optimal rule learning heuristics , 2010, Machine Learning.

[6]  Jaideep Srivastava,et al.  Selecting the right interestingness measure for association patterns , 2002, KDD.

[7]  Bingsheng He,et al.  Frequent itemset mining on graphics processors , 2009, DaMoN '09.

[8]  Howard J. Hamilton,et al.  Interestingness measures for data mining: A survey , 2006, CSUR.

[9]  J. R. Quinlan Learning Logical Definitions from Relations , 1990 .

[10]  Ashwin Srinivasan,et al.  Parameter Screening and Optimisation for ILP using Designed Experiments , 2011, J. Mach. Learn. Res..

[11]  Johannes Fürnkranz,et al.  ROC ‘n’ Rule Learning—Towards a Better Understanding of Covering Algorithms , 2005, Machine Learning.

[12]  R. Mike Cameron-Jones,et al.  Oversearching and Layered Search in Empirical Learning , 1995, IJCAI.

[13]  Jian Pei,et al.  Mining frequent patterns without candidate generation , 2000, SIGMOD '00.

[14]  Johannes Fürnkranz,et al.  A Re-evaluation of the Over-Searching Phenomenon in Inductive Rule Learning , 2008, LWA.

[15]  Sašo Džeroski,et al.  Using the m -estimate in rule induction , 1993 .

[16]  Peter Clark,et al.  Induction in Noisy Domains , 1987, EWSL.

[17]  Ivan Bratko,et al.  Why Is Rule Learning Optimistic and How to Correct It , 2006, ECML.

[18]  Bojan Cestnik,et al.  Estimating Probabilities: A Crucial Task in Machine Learning , 1990, ECAI.

[19]  Heikki Mannila,et al.  Fast Discovery of Association Rules , 1996, Advances in Knowledge Discovery and Data Mining.

[20]  Peter Clark,et al.  The CN2 induction algorithm , 2004, Machine Learning.

[21]  Johannes Fürnkranz,et al.  Foundations of Rule Learning , 2012, Cognitive Technologies.

[22]  JOHANNES FÜRNKRANZ,et al.  Separate-and-Conquer Rule Learning , 1999, Artificial Intelligence Review.