论文信息 - Efficiently handling feature redundancy in high-dimensional data

Efficiently handling feature redundancy in high-dimensional data

High-dimensional data poses a severe challenge for data mining. Feature selection is a frequently used technique in pre-processing high-dimensional data for successful data mining. Traditionally, feature selection is focused on removing irrelevant features. However, for high-dimensional data, removing redundant features is equally critical. In this paper, we provide a study of feature redundancy in high-dimensional data and propose a novel correlation-based approach to feature selection within the filter model. The extensive empirical study using real-world data shows that the proposed approach is efficient and effective in removing redundant and irrelevant features.

Huan Liu | Lei Yu | Huan Liu | Lei Yu

[1] Huan Liu,et al. Feature Selection with Selective Sampling , 2002, International Conference on Machine Learning.

[2] Alberto Maria Segre,et al. Programs for Machine Learning , 1994 .

[3] Ron Kohavi,et al. Wrappers for Feature Subset Selection , 1997, Artif. Intell..

[4] Huan Liu,et al. Discretization: An Enabling Technique , 2002, Data Mining and Knowledge Discovery.

[5] Pat Langley,et al. Selection of Relevant Features and Examples in Machine Learning , 1997, Artif. Intell..

[6] Huan Liu,et al. Consistency Based Feature Selection , 2000, PAKDD.

[7] Ian H. Witten,et al. Data mining: practical machine learning tools and techniques, 3rd Edition , 1999 .

[8] Huan Liu,et al. Active Feature Selection Using Classes , 2003, PAKDD.

[9] Catherine Blake,et al. UCI Repository of machine learning databases , 1998 .

[10] Huan Liu,et al. Customer Retention via Data Mining , 2000, Artificial Intelligence Review.

[11] Sanmay Das,et al. Filters, Wrappers and a Boosting-Based Hybrid for Feature Selection , 2001, ICML.

[12] Michael I. Jordan,et al. Feature selection for high-dimensional genomic microarray data , 2001, ICML.

[13] 金田重郎,et al. C4.5: Programs for Machine Learning (書評) , 1995 .

[14] Mark A. Hall,et al. Correlation-based Feature Selection for Machine Learning , 2003 .

[15] Huan Liu,et al. A Probabilistic Approach to Feature Selection - A Filter Solution , 1996, ICML.

[16] Andrew Y. Ng,et al. On Feature Selection: Learning with Exponentially Many Irrelevant Features as Training Examples , 1998, ICML.

[17] J. Ross Quinlan,et al. C4.5: Programs for Machine Learning , 1992 .

[18] Ian H. Witten,et al. Data mining: practical machine learning tools and techniques with Java implementations , 2002, SGMD.

[19] Ron Kohavi,et al. Feature Selection for Knowledge Discovery and Data Mining , 1998 .

[20] Larry A. Rendell,et al. The Feature Selection Problem: Traditional Methods and a New Algorithm , 1992, AAAI.

[21] Igor Kononenko,et al. Estimating Attributes: Analysis and Extensions of RELIEF , 1994, ECML.

[22] William H. Press,et al. Numerical recipes in C , 2002 .

[23] Huan Liu,et al. Feature Selection for Classification , 1997, Intell. Data Anal..

[24] Yiming Yang,et al. A Comparative Study on Feature Selection in Text Categorization , 1997, ICML.