Context-dependent feature analysis with random forests

In many cases, feature selection is often more complicated than identifying a single subset of input variables that would together explain the output. There may be interactions that depend on contextual information, i.e., variables that reveal to be relevant only in some specific circumstances. In this setting, the contribution of this paper is to extend the random forest variable importances framework in order (i) to identify variables whose relevance is context-dependent and (ii) to characterize as precisely as possible the effect of contextual information on these variables. The usage and the relevance of our framework for highlighting context-dependent variables is illustrated on both artificial and real datasets.

[1]  Diogo M. Camacho,et al.  Wisdom of crowds for robust gene network inference , 2012, Nature Methods.

[2]  Gavin Brown,et al.  A New Perspective for Information Theoretic Feature Selection , 2009, AISTATS.

[3]  Ron Kohavi,et al.  Wrappers for Feature Subset Selection , 1997, Artif. Intell..

[4]  Leo Breiman,et al.  Random Forests , 2001, Machine Learning.

[5]  S. Gabriel,et al.  Integrated genomic analysis identifies clinically relevant subtypes of glioblastoma characterized by abnormalities in PDGFRA, IDH1, EGFR, and NF1. , 2010, Cancer cell.

[6]  Tim van de Cruys Two Multivariate Generalizations of Pointwise Mutual Information , 2011, Proceedings of the Workshop on Distributional Semantics and Compositionality.

[7]  Craig Boutilier,et al.  Context-Specific Independence in Bayesian Networks , 1996, UAI.

[8]  Aleks Jakulin Machine Learning Based on Attribute Interactions , 2005 .

[9]  Ivan Bratko,et al.  Analyzing Attribute Dependencies , 2003, PKDD.

[10]  Nevin Lianwen Zhang,et al.  On the Role of Context-Specific Independence in Probabilistic Inference , 1999, IJCAI.

[11]  T. Ideker,et al.  Differential network biology , 2012, Molecular systems biology.

[12]  P. Geurts,et al.  Inferring Regulatory Networks from Expression Data Using Tree-Based Methods , 2010, PloS one.

[13]  Gilles Louppe,et al.  Understanding variable importances in forests of randomized trees , 2013, NIPS.

[14]  Su-In Lee,et al.  Node-based learning of multiple Gaussian graphical models , 2013, J. Mach. Learn. Res..

[15]  Cheng Li,et al.  Adjusting batch effects in microarray expression data using empirical Bayes methods. , 2007, Biostatistics.

[16]  Peter D. Turney The Identification of Context-Sensitive Features: A Formal Definition of Context for Concept Learning , 2002, ArXiv.

[17]  M. Südkamp,et al.  Risk stratification in heart surgery: comparison of six score systems. , 2000, European journal of cardio-thoracic surgery : official journal of the European Association for Cardio-thoracic Surgery.

[18]  Lincoln Stein,et al.  Reactome knowledgebase of human biological pathways and processes , 2008, Nucleic Acids Res..

[19]  Gavin Brown,et al.  Conditional Likelihood Maximisation: A Unifying Framework for Information Theoretic Feature Selection , 2012, J. Mach. Learn. Res..

[20]  Wei-Yin Loh,et al.  Classification and regression trees , 2011, WIREs Data Mining Knowl. Discov..

[21]  William J. McGill Multivariate information transmission , 1954, Trans. IRE Prof. Group Inf. Theory.

[22]  Isabelle Guyon,et al.  An Introduction to Variable and Feature Selection , 2003, J. Mach. Learn. Res..