Optimal number of features as a function of sample size for various classification rules

MOTIVATION Given the joint feature-label distribution, increasing the number of features always results in decreased classification error; however, this is not the case when a classifier is designed via a classification rule from sample data. Typically (but not always), for fixed sample size, the error of a designed classifier decreases and then increases as the number of features grows. The potential downside of using too many features is most critical for small samples, which are commonplace for gene-expression-based classifiers for phenotype discrimination. For fixed sample size and feature-label distribution, the issue is to find an optimal number of features. RESULTS Since only in rare cases is there a known distribution of the error as a function of the number of features and sample size, this study employs simulation for various feature-label distributions and classification rules, and across a wide range of sample and feature-set sizes. To achieve the desired end, finding the optimal number of features as a function of sample size, it employs massively parallel computation. Seven classifiers are treated: 3-nearest-neighbor, Gaussian kernel, linear support vector machine, polynomial support vector machine, perceptron, regular histogram and linear discriminant analysis. Three Gaussian-based models are considered: linear, nonlinear and bimodal. In addition, real patient data from a large breast-cancer study is considered. To mitigate the combinatorial search for finding optimal feature sets, and to model the situation in which subsets of genes are co-regulated and correlation is internal to these subsets, we assume that the covariance matrix of the features is blocked, with each block corresponding to a group of correlated features. Altogether there are a large number of error surfaces for the many cases. These are provided in full on a companion website, which is meant to serve as resource for those working with small-sample classification. AVAILABILITY For the companion website, please visit http://public.tgen.org/tamu/ofs/ CONTACT e-dougherty@ee.tamu.edu.

[1]  Donald St. P. Richards,et al.  Exact Misclassification Probabilities for Plug-In Normal Quadratic Discriminant Functions. I. The Equal-Means Case , 2001 .

[2]  Jan M. Van Campenhout,et al.  On the Possible Orderings in the Measurement Selection Problem , 1977, IEEE Transactions on Systems, Man, and Cybernetics.

[3]  J. V. Ness,et al.  On the Effects of Dimension in Discriminant Analysis , 1976 .

[4]  Christian Vandendorpe,et al.  Régimes du visuel et transformations de l'allégorie , 2006 .

[5]  Anil K. Jain,et al.  39 Dimensionality and sample size considerations in pattern recognition practice , 1982, Classification, Pattern Recognition and Reduction of Dimensionality.

[6]  Josef Kittler,et al.  Floating search methods in feature selection , 1994, Pattern Recognit. Lett..

[7]  Zixiang Xiong,et al.  Determination of the optimal number of features for quadratic discriminant analysis via the normal approximation to the discriminant distribution , 2005, Pattern Recognit..

[8]  Anil K. Jain,et al.  Feature Selection: Evaluation, Application, and Small Sample Performance , 1997, IEEE Trans. Pattern Anal. Mach. Intell..

[9]  C. C. Chang,et al.  Libsvm : introduction and benchmarks , 2000 .

[10]  G. F. Hughes,et al.  On the mean accuracy of statistical pattern recognizers , 1968, IEEE Trans. Inf. Theory.

[11]  Van,et al.  A gene-expression signature as a predictor of survival in breast cancer. , 2002, The New England journal of medicine.

[12]  Donald St. P. Richards,et al.  Exact Misclassification Probabilities for Plug-In Normal Quadratic Discriminant Functions , 2002 .

[13]  A. G. Wacker,et al.  Effect of dimensionality and estimation on the performance of gaussian classifiers , 1980, Pattern Recognit..

[14]  Mineichi Kudo,et al.  Comparison of algorithms that select features for pattern classifiers , 2000, Pattern Recognit..

[15]  Yudong D. He,et al.  Gene expression profiling predicts clinical outcome of breast cancer , 2002, Nature.

[16]  Edward R. Dougherty,et al.  Is cross-validation valid for small-sample microarray classification? , 2004, Bioinform..

[17]  Anil K. Jain,et al.  On the optimal number of features in the classification of multivariate Gaussian data , 1978, Pattern Recognit..

[18]  Ulisses Braga-Neto,et al.  Exact performance of error estimators for discrete classifiers , 2005, Pattern Recognit..