论文信息 - A pitfall and solution in multi-class feature selection for text classification

A pitfall and solution in multi-class feature selection for text classification

Information Gain is a well-known and empirically proven method for high-dimensional feature selection. We found that it and other existing methods failed to produce good results on an industrial text classification problem. On investigating the root cause, we find that a large class of feature scoring methods suffers a pitfall: they can be blinded by a surplus of strongly predictive features for some classes, while largely ignoring features needed to discriminate difficult classes. In this paper we demonstrate this pitfall hurts performance even for a relatively uniform text classification task. Based on this understanding, we present solutions inspired by round-robin scheduling that avoid this pitfall, without resorting to costly wrapper methods. Empirical evaluation on 19 datasets shows substantial improvements.

George Forman | George Forman

[1] Jason Weston,et al. Gene Selection for Cancer Classification using Support Vector Machines , 2002, Machine Learning.

[2] George Forman,et al. An Extensive Empirical Study of Feature Selection Metrics for Text Classification , 2003, J. Mach. Learn. Res..

[3] Eui-Hong,et al. Centroid-Based Document Classifica tion : Analysis & Exper imental Results ∗ , 2000 .

[4] Dunja Mladenic,et al. Feature Selection for Unbalanced Class Distribution and Naive Bayes , 1999, ICML.

[5] Thorsten Joachims,et al. Text Categorization with Support Vector Machines: Learning with Many Relevant Features , 1998, ECML.

[6] Jason D. M. Rennie. Improving multi-class text classification with Naive Bayes , 2001 .

[7] Yiming Yang,et al. A Comparative Study on Feature Selection in Text Categorization , 1997, ICML.

[8] Geoff Holmes,et al. Benchmarking Attribute Selection Techniques for Discrete Class Data Mining , 2003, IEEE Trans. Knowl. Data Eng..

[9] George Karypis,et al. Centroid-Based Document Classification Algorithms: Analysis & Experimental Results , 2000 .

[10] Yiming Yang,et al. A re-examination of text categorization methods , 1999, SIGIR '99.

[11] Dunja Mladenic,et al. Word sequences as features in text-learning , 1998 .

[12] George Karypis,et al. Centroid-Based Document Classification: Analysis and Experimental Results , 2000, PKDD.

[13] Ian H. Witten,et al. Data mining: practical machine learning tools and techniques with Java implementations , 2002, SGMD.

[14] Ron Kohavi,et al. Wrappers for Feature Subset Selection , 1997, Artif. Intell..