Correlation-based Feature Selection for Machine Learning

A central problem in machine learning is identifying a representative set of features from which to construct a classification model for a particular task. This thesis addresses the problem of feature selection for machine learning through a correlation based approach. The central hypothesis is that good feature sets contain features that are highly correlated with the class, yet uncorrelated with each other. A feature evaluation formula, based on ideas from test theory, provides an operational definition of this hypothesis. CFS (Correlation based Feature Selection) is an algorithm that couples this evaluation formula with an appropriate correlation measure and a heuristic search strategy. CFS was evaluated by experiments on artificial and natural datasets. Three machine learning algorithms were used: C4.5 (a decision tree learner), IB1 (an instance based learner), and naive Bayes. Experiments on artificial datasets showed that CFS quickly identifies and screens irrelevant, redundant, and noisy features, and identifies relevant features as long as their relevance does not strongly depend on other features. On natural domains, CFS typically eliminated well over half the features. In most cases, classification accuracy using the reduced feature set equaled or bettered accuracy using the complete feature set. Feature selection degraded machine learning performance in cases where some features were eliminated which were highly predictive of very small areas of the instance space. Further experiments compared CFS with a wrapper—a well known approach to feature selection that employs the target learning algorithm to evaluate feature sets. In many cases CFS gave comparable results to the wrapper, and in general, outperformed the wrapper on small datasets. CFS executes many times faster than the wrapper, which allows it to scale to larger datasets. Two methods of extending CFS to handle feature interaction are presented and experimentally evaluated. The first considers pairs of features and the second incorporates iii feature weights calculated by the RELIEF algorithm. Experiments on artificial domains showed that both methods were able to identify interacting features. On natural domains, the pairwise method gave more reliable results than using weights provided by RELIEF.

[1]  Yoav Freund,et al.  Experiments with a New Boosting Algorithm , 1996, ICML.

[2]  Nir Friedman,et al.  Building Classifiers Using Bayesian Networks , 1996, AAAI/IAAI, Vol. 2.

[3]  I. Bratko,et al.  Information-based evaluation criterion for classifier's performance , 2004, Machine Learning.

[4]  Ryszard S. Michalski,et al.  A theory and methodology of inductive learning , 1993 .

[5]  Thomas Marill,et al.  On the effectiveness of receptors in recognition systems , 1963, IEEE Trans. Inf. Theory.

[6]  D. Kibler,et al.  Instance-based learning algorithms , 2004, Machine Learning.

[7]  Alan J. Miller,et al.  Subset Selection in Regression , 1991 .

[8]  Michael J. Pazzani,et al.  Searching for Dependencies in Bayesian Classifiers , 1995, AISTATS.

[9]  Ron Kohavi,et al.  Irrelevant Features and the Subset Selection Problem , 1994, ICML.

[10]  Yoav Freund,et al.  Boosting the margin: A new explanation for the effectiveness of voting methods , 1997, ICML.

[11]  Seymour Geisser,et al.  The Predictive Sample Reuse Method with Applications , 1975 .

[12]  Ron Kohavi,et al.  Supervised and Unsupervised Discretization of Continuous Features , 1995, ICML.

[13]  Gregory M. Provan,et al.  Learning Bayesian Networks Using Feature Selection , 1995, AISTATS.

[14]  J. Rissanen,et al.  Modeling By Shortest Data Description* , 1978, Autom..

[15]  Daniel N. Hill,et al.  An Empirical Investigation of Brute Force to choose Features, Smoothers and Function Approximators , 1992 .

[16]  Huan Liu,et al.  A Probabilistic Approach to Feature Selection - A Filter Solution , 1996, ICML.

[17]  David W. Aha,et al.  Weighting Features , 1995, ICCBR.

[18]  Igor Kononenko,et al.  Estimating Attributes: Analysis and Extensions of RELIEF , 1994, ECML.

[19]  Huan Liu,et al.  Chi2: feature selection and discretization of numeric attributes , 1995, Proceedings of 7th IEEE International Conference on Tools with Artificial Intelligence.

[20]  Z. Pawlak Rough Sets: Theoretical Aspects of Reasoning about Data , 1991 .

[21]  Herbert A. Simon,et al.  Applications of machine learning and rule induction , 1995, CACM.

[22]  Alan Hutchinson,et al.  Algorithmic Learning , 1994 .

[23]  Pat Langley,et al.  Induction of Selective Bayesian Classifiers , 1994, UAI.

[24]  W. W. Daniel Applied Nonparametric Statistics , 1979 .

[25]  Keinosuke Fukunaga,et al.  A Branch and Bound Algorithm for Feature Subset Selection , 1977, IEEE Transactions on Computers.

[26]  J. Ross Quinlan,et al.  Simplifying decision trees , 1987, Int. J. Hum. Comput. Stud..

[27]  Ian H. Witten,et al.  WEKA: a machine learning workbench , 1994, Proceedings of ANZIIS '94 - Australian New Zealnd Intelligent Information Systems Conference.

[28]  Andrew W. Moore,et al.  Efficient Algorithms for Minimizing Cross Validation Error , 1994, ICML.

[29]  J. Ross Quinlan,et al.  C4.5: Programs for Machine Learning , 1992 .

[30]  Mehran Sahami,et al.  Learning Limited Dependence Bayesian Classifiers , 1996, KDD.

[31]  Steven Salzberg,et al.  A Nearest Hyperrectangle Learning Method , 1991, Machine Learning.

[32]  Thomas H. Wonnacott,et al.  Introductory Statistics , 2007, Technometrics.

[33]  Judea Pearl,et al.  Probabilistic reasoning in intelligent systems - networks of plausible inference , 1991, Morgan Kaufmann series in representation and reasoning.

[34]  Sebastian Thrun,et al.  The MONK''s Problems-A Performance Comparison of Different Learning Algorithms, CMU-CS-91-197, Sch , 1991 .

[35]  David B. Skalak,et al.  Prototype and Feature Selection by Sampling and Random Mutation Hill Climbing Algorithms , 1994, ICML.

[36]  Igor Kononenko,et al.  Semi-Naive Bayesian Classifier , 1991, EWSL.

[37]  Sally Jo Cunningham,et al.  Applications of machine learning in information retrieval , 1999 .

[38]  P. Langley Selection of Relevant Features in Machine Learning , 1994 .

[39]  Thomas G. Dietterich What is machine learning? , 2020, Archives of Disease in Childhood.

[40]  Ian H. Witten,et al.  An MDL estimate of the significance of rules , 1996 .

[41]  Geoffrey Holmes,et al.  Feature selection via the discovery of simple classification rules , 1995 .

[42]  Leo Breiman,et al.  Technical note: Some properties of splitting criteria , 2004, Machine Learning.

[43]  Ron Kohavi,et al.  Useful Feature Subsets and Rough Set Reducts , 1994 .

[44]  Robert B. Zajonc,et al.  A Note on Group Judgements and Group Size , 1962 .

[45]  R. Hogarth Methods for Aggregating Opinions , 1977 .

[46]  Pat Langley,et al.  Models of Incremental Concept Formation , 1990, Artif. Intell..

[47]  Ron Kohavi,et al.  MLC++: a machine learning library in C++ , 1994, Proceedings Sixth International Conference on Tools with Artificial Intelligence. TAI 94.

[48]  William Frawley,et al.  Knowledge Discovery in Databases , 1991 .

[49]  David M. Allen,et al.  The Relationship Between Variable Selection and Data Agumentation and a Method for Prediction , 1974 .

[50]  C. Brodley,et al.  On the Qualitative Behavior of Impurity-Based Splitting Rules I: The Minima-Free Property , 1997 .

[51]  William H. Press,et al.  Numerical recipes in C , 2002 .

[52]  Steven Salzberg,et al.  A Weighted Nearest Neighbor Algorithm for Learning with Symbolic Features , 2004, Machine Learning.

[53]  Jason Catlett,et al.  On Changing Continuous Attributes into Ordered Discrete Attributes , 1991, EWSL.

[54]  Kenneth DeJong,et al.  Genetic algorithms as a tool for restructuring feature space representations , 1995, Proceedings of 7th IEEE International Conference on Tools with Artificial Intelligence.

[55]  Jude W. Shavlik,et al.  Growing Simpler Decision Trees to Facilitate Knowledge Discovery , 1996, KDD.

[56]  Robert C. Holte,et al.  Very Simple Classification Rules Perform Well on Most Commonly Used Datasets , 1993, Machine Learning.

[57]  Wilfried Brauer,et al.  Feature Selection by Means of a Feature Weighting Approach , 1997 .

[58]  J. Kittler,et al.  Feature Set Search Alborithms , 1978 .

[59]  Larry A. Rendell,et al.  A Practical Approach to Feature Selection , 1992, ML.

[60]  David Bainbridge,et al.  Musical image compression , 1998, Proceedings DCC '98 Data Compression Conference (Cat. No.98TB100225).

[61]  E. Ghiselli Theory of psychological measurement , 1964 .

[62]  David W. Aha,et al.  Feature Selection for Case-Based Classification of Cloud Types: An Empirical Comparison , 1994 .

[63]  Thomas W. Parsons,et al.  Voice and Speech Processing , 1986 .

[64]  David L. Waltz,et al.  Trading MIPS and memory for knowledge engineering , 1992, CACM.

[65]  Alberto Maria Segre,et al.  Programs for Machine Learning , 1994 .

[66]  Ron Kohavi,et al.  Wrappers for Feature Subset Selection , 1997, Artif. Intell..

[67]  Leo Breiman,et al.  Bias, Variance , And Arcing Classifiers , 1996 .

[68]  J. Ross Quinlan,et al.  Induction of Decision Trees , 1986, Machine Learning.

[69]  Cullen Schaffer,et al.  Selecting a classification method by cross-validation , 1993, Machine Learning.

[70]  Janusz Zalewski,et al.  Rough sets: Theoretical aspects of reasoning about data , 1996 .

[71]  Christopher J. Merz,et al.  UCI Repository of Machine Learning Databases , 1996 .

[72]  Gregory M. Provan,et al.  Efficient Learning of Selective Bayesian Network Classifiers , 1996, ICML.

[73]  Usama M. Fayyad,et al.  Multi-Interval Discretization of Continuous-Valued Attributes for Classification Learning , 1993, IJCAI.

[74]  Maciej Modrzejewski,et al.  Feature Selection Using Rough Sets Theory , 1993, ECML.

[75]  Pedro M. Domingos,et al.  On the Optimality of the Simple Bayesian Classifier under Zero-One Loss , 1997, Machine Learning.

[76]  Igor Kononenko,et al.  On Biases in Estimating Multi-Valued Attributes , 1995, IJCAI.

[77]  Nada Lavrac,et al.  Conditions for Occam's Razor Applicability and Noise Elimination , 1997, ECML.

[78]  A. Atkinson Subset Selection in Regression , 1992 .

[79]  Jörg Rech,et al.  Knowledge Discovery in Databases , 2001, Künstliche Intell..

[80]  Wei Zhong Liu,et al.  Bias in information-based measures in decision tree induction , 1994, Machine Learning.

[81]  Daphne Koller,et al.  Toward Optimal Feature Selection , 1996, ICML.

[82]  Claire Cardie,et al.  Using Decision Trees to Improve Case-Based Learning , 1993, ICML.

[83]  David W. Aha,et al.  Tolerating Noisy, Irrelevant and Novel Attributes in Instance-Based Learning Algorithms , 1992, Int. J. Man Mach. Stud..

[84]  Ron Kohavi,et al.  The Power of Decision Tables , 1995, ECML.

[85]  Ron Kohavi,et al.  Feature Subset Selection Using the Wrapper Method: Overfitting and Dynamic Search Space Topology , 1995, KDD.

[86]  Rich Caruana,et al.  Greedy Attribute Selection , 1994, ICML.

[87]  Josef Kittler,et al.  Pattern recognition : a statistical approach , 1982 .

[88]  Pat Langley,et al.  Scaling to domains with irrelevant features , 1997, COLT 1997.

[89]  Andrew K. C. Wong,et al.  Synthesizing Statistical Knowledge from Incomplete Mixed-Mode Data , 1987, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[90]  Thomas G. Dietterich,et al.  Learning with Many Irrelevant Features , 1991, AAAI.

[91]  Ron Kohavi,et al.  Wrappers for performance enhancement and oblivious decision graphs , 1995 .

[92]  Pat Langley,et al.  Oblivious Decision Trees and Abstract Cases , 1994 .

[93]  John H. Holland,et al.  Adaptation in Natural and Artificial Systems: An Introductory Analysis with Applications to Biology, Control, and Artificial Intelligence , 1992 .

[94]  Pedro M. Domingos Control-Sensitive Feature Selection for Lazy Learners , 1997, Artificial Intelligence Review.

[95]  John G. Cleary,et al.  K*: An Instance-based Learner Using and Entropic Distance Measure , 1995, ICML.

[96]  Ronald L. Rivest,et al.  Inferring Decision Trees Using the Minimum Description Length Principle , 1989, Inf. Comput..

[97]  Ron Kohavi,et al.  The Utility of Feature Weighting in Nearest-Neighbor Algorithms , 1997 .

[98]  Peter Clark,et al.  The CN2 Induction Algorithm , 1989, Machine Learning.

[99]  R. Singer,et al.  The Audubon Society field guide to North American mushrooms , 1981 .

[100]  Pat Langley,et al.  Static Versus Dynamic Sampling for Data Mining , 1996, KDD.

[101]  Daphne Koller,et al.  Hierarchically Classifying Documents Using Very Few Words , 1997, ICML.

[102]  Pedro M. Domingos,et al.  Beyond Independence: Conditions for the Optimality of the Simple Bayesian Classifier , 1996, ICML.

[103]  Thomas G. Dietterich Approximate Statistical Tests for Comparing Supervised Classification Learning Algorithms , 1998, Neural Computation.

[104]  Ron Kohavi,et al.  Error-Based and Entropy-Based Discretization of Continuous Features , 1996, KDD.

[105]  Thomas G. Dietterich,et al.  Efficient Algorithms for Identifying Relevant Features , 1992 .