Feature selection by iterative reweighting: an exploration of algorithms for linear models and random forests

In many areas of machine learning and data science, the available data are represented as vectors of feature values. Some of these features are useful for prediction, but others are spurious or redundant. Feature selection is commonly used to determine the utility of a feature. Typically, features are selected in an all-or-none fashion for inclusion in a model. We describe an alternative approach that has received little attention in the literature: determining the relative importance of features via continuous weights, and performing multiple iterations of model training to iteratively reweight features such that the least useful features eventually obtain a weight of zero. We explore feature selection by employing iterative reweighting for two classes of popular machine learning models: L1 penalized linear models and Random Forests. Recent studies have shown that incorporating importance weights into L1 models leads to improvement in predictive performance in a single iteration of training. In Chapter 3, we advance the state-of-the-art by developing an alternative method for estimating feature importance based on subsampling. Extending the approach to multiple iterations of training, employing the importance weights from iteration n to bias the training on iteration n + 1 seems promising, but past studies yielded no benefit to iterative reweighting. In Chapter 4, we obtain a significant reduction of 7.48% in the error rate over standard L1 penalized algorithms, and nearly as large an improvement over alternative feature selection algorithms such as Adaptive Lasso, Bootstrap Lasso, and MSA-LASSO using our improved estimates of feature importance. In Chapter 5, we consider iterative reweighting in the context of Random Forests and contrast this with a more standard backward-elimination technique that involves training models with the full complement of features and iteratively removing the least important feature. In parallel with this contrast, we also compare several measures of importance, including our own proposal based on evaluating models constructed with and without each candidate feature. We show that our importance measure yields both higher accuracy and greater sparsity than importance measures obtained without retraining models (including measures proposed by Breiman and Strobl), though at a greater computational cost.

[1]  Igor Kononenko,et al.  Estimating Attributes: Analysis and Extensions of RELIEF , 1994, ECML.

[2]  R. Tyrrell Rockafellar,et al.  Convex Analysis , 1970, Princeton Landmarks in Mathematics and Physics.

[3]  Lior Rokach,et al.  Ensemble-based classifiers , 2010, Artificial Intelligence Review.

[4]  David L Donoho,et al.  Compressed sensing , 2006, IEEE Transactions on Information Theory.

[5]  Yoav Freund,et al.  Experiments with a New Boosting Algorithm , 1996, ICML.

[6]  Gregory Z. Grudic,et al.  Learning terrain segmentation with classifier ensembles for autonomous robot navigation in unstructured environments , 2009, J. Field Robotics.

[7]  J Elith,et al.  A working guide to boosted regression trees. , 2008, The Journal of animal ecology.

[8]  Isabelle Guyon,et al.  Results of the Active Learning Challenge , 2011, Active Learning and Experimental Design @ AISTATS.

[9]  Andreas Ziegler,et al.  On safari to Random Jungle: a fast implementation of Random Forests for high-dimensional data , 2010, Bioinform..

[10]  Yali Amit,et al.  Shape Quantization and Recognition with Randomized Trees , 1997, Neural Computation.

[11]  J. Friedman Fast sparse regression and classification , 2012 .

[12]  S. Suryanarayanan,et al.  An evolutionary algorithm and acceleration approach for topological design of distributed resource islands , 2011, 2011 IEEE Trondheim PowerTech.

[13]  Ron Kohavi,et al.  Irrelevant Features and the Subset Selection Problem , 1994, ICML.

[14]  Andy Liaw,et al.  Classification and Regression by randomForest , 2007 .

[15]  Thomas G. Dietterich Approximate Statistical Tests for Comparing Supervised Classification Learning Algorithms , 1998, Neural Computation.

[16]  Guilherme V. Rocha,et al.  Asymptotic distribution and sparsistency for `1 penalized parametric M-estimators, with applications to linear SVM and logistic regression , 2009, 0908.1940.

[17]  Leo Breiman,et al.  Stacked regressions , 2004, Machine Learning.

[18]  S. García,et al.  An Extension on "Statistical Comparisons of Classifiers over Multiple Data Sets" for all Pairwise Comparisons , 2008 .

[19]  R. Pace,et al.  Sparse spatial autoregressions , 1997 .

[20]  Albert Y. Zomaya,et al.  A Review of Ensemble Methods in Bioinformatics , 2010, Current Bioinformatics.

[21]  Isabelle Guyon,et al.  Agnostic Learning vs. Prior Knowledge Challenge , 2007, 2007 International Joint Conference on Neural Networks.

[22]  John C. Platt,et al.  Fast training of support vector machines using sequential minimal optimization, advances in kernel methods , 1999 .

[23]  P. Burman A comparative study of ordinary cross-validation, v-fold cross-validation and the repeated learning-testing methods , 1989 .

[24]  Kathy L Ryan,et al.  Lower body negative pressure as a model to study progression to acute hemorrhagic shock in humans. , 2004, Journal of applied physiology.

[25]  Denis J. Dean,et al.  Comparative accuracies of artificial neural networks and discriminant analysis in predicting forest cover types from cartographic variables , 1999 .

[26]  K. Hornik,et al.  party : A Laboratory for Recursive Partytioning , 2009 .

[27]  R. Tibshirani,et al.  Sparsity and smoothness via the fused lasso , 2005 .

[28]  L. Breiman Heuristics of instability and stabilization in model selection , 1996 .

[29]  R. Penrose A Generalized inverse for matrices , 1955 .

[30]  Adele Cutler,et al.  PERT – Perfect Random Tree Ensembles , 2001 .

[31]  Ron Kohavi,et al.  Wrappers for Feature Subset Selection , 1997, Artif. Intell..

[32]  Robert E. Schapire,et al.  The Boosting Approach to Machine Learning An Overview , 2003 .

[33]  E.J. Candes,et al.  An Introduction To Compressive Sampling , 2008, IEEE Signal Processing Magazine.

[34]  Tin Kam Ho,et al.  The Random Subspace Method for Constructing Decision Forests , 1998, IEEE Trans. Pattern Anal. Mach. Intell..

[35]  Francis R. Bach,et al.  Bolasso: model consistent Lasso estimation through the bootstrap , 2008, ICML '08.

[36]  Hui Zou An Improved 1-norm SVM for Simultaneous Classification and Variable Selection , 2007, AISTATS.

[37]  Anthony C. Davison,et al.  Bootstrap Methods and Their Application , 1998 .

[38]  Peng Zhao,et al.  On Model Selection Consistency of Lasso , 2006, J. Mach. Learn. Res..

[39]  Achim Zeileis,et al.  Conditional variable importance for random forests , 2008, BMC Bioinformatics.

[40]  Stephen J. Smith,et al.  Conditional variable importance in R package extendedForest , 2011 .

[41]  Ryan M. Rifkin,et al.  In Defense of One-Vs-All Classification , 2004, J. Mach. Learn. Res..

[42]  Arthur E. Hoerl,et al.  Application of ridge analysis to regression problems , 1962 .

[43]  Robert Tibshirani,et al.  An Introduction to the Bootstrap , 1994 .

[44]  David H. Wolpert,et al.  Stacked generalization , 1992, Neural Networks.

[45]  P. Diaconis,et al.  Computer-Intensive Methods in Statistics , 1983 .

[46]  Janez Demsar,et al.  Statistical Comparisons of Classifiers over Multiple Data Sets , 2006, J. Mach. Learn. Res..

[47]  Marko Robnik-Sikonja,et al.  An adaptation of Relief for attribute estimation in regression , 1997, ICML.

[48]  L. Bottou,et al.  1 Support Vector Machine Solvers , 2007 .

[49]  Robert Tibshirani,et al.  1-norm Support Vector Machines , 2003, NIPS.

[50]  M. Masson Using confidence intervals for graphically based data interpretation. , 2003, Canadian journal of experimental psychology = Revue canadienne de psychologie experimentale.

[51]  Steve R. Gunn,et al.  Result Analysis of the NIPS 2003 Feature Selection Challenge , 2004, NIPS.

[52]  Chih-Jen Lin,et al.  A Practical Guide to Support Vector Classication , 2008 .

[53]  Chih-Jen Lin,et al.  LIBSVM: A library for support vector machines , 2011, TIST.

[54]  H. Zou,et al.  Regularization and variable selection via the elastic net , 2005 .

[55]  Huan Liu,et al.  Efficient Feature Selection via Analysis of Relevance and Redundancy , 2004, J. Mach. Learn. Res..

[56]  María José del Jesús,et al.  KEEL: a software tool to assess evolutionary algorithms for data mining problems , 2008, Soft Comput..

[57]  Ian H. Witten,et al.  The WEKA data mining software: an update , 2009, SKDD.

[58]  Roman Timofeev,et al.  Classification and Regression Trees(CART)Theory and Applications , 2004 .

[59]  Glenn Fung,et al.  A Feature Selection Newton Method for Support Vector Machine Classification , 2004, Comput. Optim. Appl..

[60]  Chih-Jen Lin,et al.  Large-Scale Linear RankSVM , 2014, Neural Computation.

[61]  Leo Breiman,et al.  Classification and Regression Trees , 1984 .

[62]  R. Tibshirani Regression Shrinkage and Selection via the Lasso , 1996 .

[63]  Stephen P. Boyd,et al.  Convex Optimization , 2004, Algorithms and Theory of Computation Handbook.

[64]  Witold R. Rudnicki,et al.  Feature Selection with the Boruta Package , 2010 .

[65]  Yi Lin,et al.  Random Forests and Adaptive Nearest Neighbors , 2006 .

[66]  J. D. Wattsa,et al.  MERGING RANDOM FOREST CLASSIFICATION WITH AN OBJECT-ORIENTED APPROACH FOR ANALYSIS OF AGRICULTURAL LANDS , 2008 .

[67]  Vladimir Vapnik,et al.  Statistical learning theory , 1998 .

[68]  Ronald L. Rivest,et al.  Constructing Optimal Binary Decision Trees is NP-Complete , 1976, Inf. Process. Lett..

[69]  Andrew W. Fitzgibbon,et al.  Real-time human pose recognition in parts from single depth images , 2011, CVPR 2011.

[70]  Lars Bergstrom,et al.  Measuring NUMA effects with the STREAM benchmark , 2011, ArXiv.

[71]  David H. Wolpert,et al.  No free lunch theorems for optimization , 1997, IEEE Trans. Evol. Comput..

[72]  Jianqing Fan,et al.  Variable Selection via Nonconcave Penalized Likelihood and its Oracle Properties , 2001 .

[73]  Thomas G. Dietterich An Experimental Comparison of Three Methods for Constructing Ensembles of Decision Trees: Bagging, Boosting, and Randomization , 2000, Machine Learning.

[74]  A. Zeileis,et al.  Danger: High Power! – Exploring the Statistical Properties of a Test for Random Forest Variable Importance , 2008 .

[75]  Stephen P. Boyd,et al.  Enhancing Sparsity by Reweighted ℓ1 Minimization , 2007, 0711.1612.

[76]  T. Hassard,et al.  Applied Linear Regression , 2005 .

[77]  Shyam Diwakar,et al.  DATA MINING: THEORY AND PRACTICE , 2006 .

[78]  Sylvain Arlot,et al.  A survey of cross-validation procedures for model selection , 2009, 0907.4728.

[79]  David H. Wolpert,et al.  Ubiquity symposium: Evolutionary computation and the processes of life: what the no free lunch theorems really mean: how to improve search algorithms , 2013, UBIQ.

[80]  L. Breiman CONSISTENCY FOR A SIMPLE MODEL OF RANDOM FORESTS , 2004 .

[81]  R. Tibshirani,et al.  Least angle regression , 2004, math/0406456.

[82]  Cun-Hui Zhang,et al.  Adaptive Lasso for sparse high-dimensional regression models , 2008 .

[83]  Jane Mulligan,et al.  Emerging technologies for pediatric and adult trauma care , 2010, Current opinion in pediatrics.

[84]  Wenjiang J. Fu Penalized Regressions: The Bridge versus the Lasso , 1998 .

[85]  H. Zou The Adaptive Lasso and Its Oracle Properties , 2006 .

[86]  S. Rosset,et al.  Piecewise linear regularized solution paths , 2007, 0708.2197.

[87]  Ross Ihaka,et al.  Gentleman R: R: A language for data analysis and graphics , 1996 .

[88]  Robert P. W. Duin,et al.  Handwritten digit recognition by combined classifiers , 1998, Kybernetika.

[89]  Nello Cristianini,et al.  Kernel Methods for Pattern Analysis , 2006 .

[90]  Pedro Larrañaga,et al.  Bioinformatics Advance Access published August 24, 2007 A review of feature selection techniques in bioinformatics , 2022 .

[91]  Robert Tibshirani,et al.  The Elements of Statistical Learning: Data Mining, Inference, and Prediction, 2nd Edition , 2001, Springer Series in Statistics.

[92]  Luc Devroye,et al.  On the layered nearest neighbour estimate, the bagged nearest neighbour estimate and the random forest method in regression and classification , 2010, J. Multivar. Anal..

[93]  Leo Breiman,et al.  Bagging Predictors , 1996, Machine Learning.

[94]  Chih-Jen Lin,et al.  Coordinate Descent Method for Large-scale L2-loss Linear Support Vector Machines , 2008, J. Mach. Learn. Res..

[95]  A. Atiya,et al.  Learning with Kernels: Support Vector Machines, Regularization, Optimization, and Beyond , 2005, IEEE Transactions on Neural Networks.

[96]  M. Pal,et al.  Random forests for land cover classification , 2003, IGARSS 2003. 2003 IEEE International Geoscience and Remote Sensing Symposium. Proceedings (IEEE Cat. No.03CH37477).

[97]  Koby Crammer,et al.  On the Algorithmic Implementation of Multiclass Kernel-based Vector Machines , 2002, J. Mach. Learn. Res..

[98]  Alex Bateman,et al.  The rise and fall of supervised machine learning techniques , 2011, Bioinform..

[99]  Chih-Jen Lin,et al.  LIBLINEAR: A Library for Large Linear Classification , 2008, J. Mach. Learn. Res..

[100]  Jason Weston,et al.  Multi-Class Support Vector Machines , 1998 .

[101]  Chong Jin Ong,et al.  A Feature Selection Method for Multilevel Mental Fatigue EEG Classification , 2007, IEEE Transactions on Biomedical Engineering.

[102]  Pierre Geurts,et al.  Extremely randomized trees , 2006, Machine Learning.

[103]  J. Ross Quinlan,et al.  Combining Instance-Based and Model-Based Learning , 1993, ICML.

[104]  Andreas Ziegler,et al.  On safari to Random Jungle: a fast implementation of Random Forests for high-dimensional data , 2010, Bioinform..

[105]  Larry A. Rendell,et al.  A Practical Approach to Feature Selection , 1992, ML.

[106]  Li Jun,et al.  Identifying Skype Traffic by Random Forest , 2007, 2007 International Conference on Wireless Communications, Networking and Mobile Computing.

[107]  Gregory Z. Grudic,et al.  Increasing Feature Selection Accuracy for L1 Regularized Linear Models , 2010, FSDM.

[108]  Trevor Hastie,et al.  Regularization Paths for Generalized Linear Models via Coordinate Descent. , 2010, Journal of statistical software.

[109]  Gérard Biau,et al.  Analysis of a Random Forests Model , 2010, J. Mach. Learn. Res..

[110]  Geoffrey E. Hinton,et al.  Bayesian Learning for Neural Networks , 1995 .

[111]  L. Breiman SOME INFINITY THEORY FOR PREDICTOR ENSEMBLES , 2000 .

[112]  F. Wilcoxon Individual Comparisons by Ranking Methods , 1945 .

[113]  Pat Langley,et al.  Selection of Relevant Features and Examples in Machine Learning , 1997, Artif. Intell..

[114]  Mario Bertero,et al.  The Stability of Inverse Problems , 1980 .

[115]  Anne-Laure Boulesteix,et al.  Overview of random forest methodology and practical guidance with emphasis on computational biology and bioinformatics , 2012, Wiley Interdiscip. Rev. Data Min. Knowl. Discov..

[116]  George C. Runger,et al.  Feature Selection with Ensembles, Artificial Variables, and Redundancy Elimination , 2009, J. Mach. Learn. Res..

[117]  Dietmar Bauer,et al.  Inferring land use from mobile phone activity , 2012, UrbComp '12.

[118]  William Gropp,et al.  Mpi---the complete reference: volume 1 , 1998 .

[119]  Ji Zhu,et al.  Boosting as a Regularized Path to a Maximum Margin Classifier , 2004, J. Mach. Learn. Res..

[120]  Leo Breiman,et al.  Random Forests , 2001, Machine Learning.