Feature Selection at the Discrete Limit

Feature selection plays an important role in many machine learning and data mining applications. In this paper, we propose to use L2,p norm for feature selection with emphasis on small p. As p → 0, feature selection becomes discrete feature selection problem. We provide two algorithms, proximal gradient algorithm and rank-one update algorithm, which is more efficient at large regularization λ. We provide closed form solutions of the proximal operator at p = 0, 1/2. Experiments on real life datasets show that features selected at small p consistently outperform features selected at p = 1, the standard L2,1 approach and other popular feature selection methods.

[1]  Jason Weston,et al.  Gene Selection for Cancer Classification using Support Vector Machines , 2002, Machine Learning.

[2]  Massimiliano Pontil,et al.  Multi-Task Feature Learning , 2006, NIPS.

[3]  Chris H. Q. Ding,et al.  Multi-label ReliefF and F-statistic feature selections for image annotation , 2012, 2012 IEEE Conference on Computer Vision and Pattern Recognition.

[4]  Roberto Battiti,et al.  Using mutual information for selecting features in supervised neural net learning , 1994, IEEE Trans. Neural Networks.

[5]  A. Ng Feature selection, L1 vs. L2 regularization, and rotational invariance , 2004, Twenty-first international conference on Machine learning - ICML '04.

[6]  Sen Wang,et al.  Semi-supervised Feature Analysis for Multimedia Annotation by Mining Label Correlation , 2014, PAKDD.

[7]  Chris H. Q. Ding,et al.  Efficient Algorithms for Selecting Features with Arbitrary Group Constraints via Group Lasso , 2013, 2013 IEEE 13th International Conference on Data Mining.

[8]  Fuhui Long,et al.  Feature selection based on mutual information criteria of max-dependency, max-relevance, and min-redundancy , 2003, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[9]  Julien Mairal,et al.  Proximal Methods for Sparse Hierarchical Dictionary Learning , 2010, ICML.

[10]  Mohammed Bennamoun,et al.  Linear Regression for Face Recognition , 2010, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[11]  Paul S. Bradley,et al.  Feature Selection via Concave Minimization and Support Vector Machines , 1998, ICML.

[12]  Xin Zhou,et al.  MSVM-RFE: extensions of SVM-RFE for multiclass gene selection on DNA microarray data , 2007, Bioinform..

[13]  Ron Kohavi,et al.  Wrappers for Feature Subset Selection , 1997, Artif. Intell..

[14]  Feiping Nie,et al.  Efficient and Robust Feature Selection via Joint ℓ2, 1-Norms Minimization , 2010, NIPS.

[15]  Chris H. Q. Ding,et al.  Minimum redundancy feature selection from microarray gene expression data , 2003, Computational Systems Bioinformatics. CSB2003. Proceedings of the 2003 IEEE Bioinformatics Conference. CSB2003.

[16]  Glenn Fung,et al.  Data selection for support vector machine classifiers , 2000, KDD '00.

[17]  Igor Kononenko,et al.  Estimating Attributes: Analysis and Extensions of RELIEF , 1994, ECML.

[18]  Chris H. Q. Ding,et al.  Analysis of gene expression profiles: class discovery and leaf ordering , 2002, RECOMB '02.

[19]  Proximal Methods for Sparse Hierarchical Dictionary Learning: Supplementary Materials , 2010 .

[20]  Lloyd A. Smith,et al.  Feature Selection for Machine Learning: Comparing a Correlation-Based Filter Approach to the Wrapper , 1999, FLAIRS.

[21]  T. Golub,et al.  Gene expression-based classification of malignant gliomas correlates better with survival than histological classification. , 2003, Cancer research.

[22]  Jieping Ye,et al.  An accelerated gradient method for trace norm minimization , 2009, ICML '09.

[23]  Y. Nesterov Gradient methods for minimizing composite objective function , 2007 .

[24]  Michael I. Jordan,et al.  Multi-task feature selection , 2006 .

[25]  P. Langley Selection of Relevant Features in Machine Learning , 1994 .

[26]  Kilian Stoffel,et al.  Theoretical Comparison between the Gini Index and Information Gain Criteria , 2004, Annals of Mathematics and Artificial Intelligence.

[27]  Li Wang,et al.  Hybrid huberized support vector machines for microarray classification , 2007, ICML '07.

[28]  Chris H. Q. Ding,et al.  R1-PCA: rotational invariant L1-norm principal component analysis for robust subspace factorization , 2006, ICML.

[29]  M. Ringnér,et al.  Classification and diagnostic prediction of cancers using gene expression profiling and artificial neural networks , 2001, Nature Medicine.

[30]  Chih-Jen Lin,et al.  A comparison of methods for multiclass support vector machines , 2002, IEEE Trans. Neural Networks.

[31]  Larry A. Rendell,et al.  A Practical Approach to Feature Selection , 1992, ML.