论文信息 - A Framework for Feature Selection in Clustering - 字舞流文

A Framework for Feature Selection in Clustering

We consider the problem of clustering observations using a potentially large set of features. One might expect that the true underlying clusters present in the data differ only with respect to a small fraction of the features, and will be missed if one clusters the observations using the full set of features. We propose a novel framework for sparse clustering, in which one clusters the observations using an adaptively chosen subset of the features. The method uses a lasso-type penalty to select the features. We use this framework to develop simple methods for sparse K-means and sparse hierarchical clustering. A single criterion governs both the selection of the features and the resulting clusters. These approaches are demonstrated on simulated and genomic data.

Robert Tibshirani | Daniela M Witten | R. Tibshirani | D. Witten

[1] A Vassault,et al. [Examination procedures]. , 2010, Annales de biologie clinique.

[2] Russ B. Altman,et al. Missing value estimation methods for DNA microarrays , 2001, Bioinform..

[3] D. Reich,et al. Principal components analysis corrects for stratification in genome-wide association studies , 2006, Nature Genetics.

[4] J. Neyman,et al. Mathematical Statistics and Probability , 1962 .

[5] J. Friedman,et al. Clustering objects on subsets of attributes (with discussion) , 2004 .

[6] William M. Rand,et al. Objective Criteria for the Evaluation of Clustering Methods , 1971 .

[7] A. Raftery,et al. Variable Selection for Model-Based Clustering , 2006 .

[8] M. Olivier. A haplotype map of the human genome , 2003, Nature.

[9] R. Tibshirani,et al. A penalized matrix decomposition, with applications to sparse principal components and canonical correlation analysis. , 2009, Biostatistics.

[10] Jinchi Lv,et al. A unified approach to model selection and sparse recovery using regularized least squares , 2009, 0905.3573.

[11] Stephen P. Boyd,et al. Convex Optimization , 2004, Algorithms and Theory of Computation Handbook.

[12] Wei-Chien Chang. On using Principal Components before Separating a Mixture of Two Multivariate Normal Distributions , 1983 .

[13] J. MacQueen. Some methods for classification and analysis of multivariate observations , 1967 .

[14] Robert Tibshirani,et al. Hybrid hierarchical clustering with applications to microarray data. , 2005, Biostatistics.

[15] Peter J. Rousseeuw,et al. Finding Groups in Data: An Introduction to Cluster Analysis , 1991 .

[16] Ravi Kothari,et al. On finding the number of clusters , 1999, Pattern Recognit. Lett..

[17] J. Friedman. Clustering objects on subsets of attributes , 2002 .

[18] Geoffrey J. McLachlan,et al. A mixture model-based approach to the clustering of microarray expression data , 2002, Bioinform..

[19] H. Sebastian Seung,et al. Algorithms for Non-negative Matrix Factorization , 2000, NIPS.

[20] Wei Pan,et al. Penalized Model-Based Clustering with Application to Variable Selection , 2007, J. Mach. Learn. Res..

[21] Wei Pan,et al. Penalized model-based clustering with cluster-specific diagonal covariance matrices and grouped variables. , 2008, Electronic journal of statistics.

[22] André Hardy,et al. An examination of procedures for determining the number of clusters in a data set , 1994 .

[23] Ji Zhu,et al. Variable Selection for Model‐Based High‐Dimensional Clustering and Its Application to Microarray Data , 2008, Biometrics.

[24] Wei Zhang,et al. Penalized Model-Based Clustering , 2009 .

[25] Jianqing Fan,et al. Variable Selection via Nonconcave Penalized Likelihood and its Oracle Properties , 2001 .

[26] Debashis Ghosh,et al. Mixture modelling of gene expression data from microarray experiments , 2002, Bioinform..

[27] G. Celeux,et al. Variable Selection for Clustering with Gaussian Mixture Models , 2009, Biometrics.

[28] P. Deb. Finite Mixture Models , 2008 .

[29] Geoffrey J. McLachlan,et al. Modelling high-dimensional data by mixtures of factor analyzers , 2003, Comput. Stat. Data Anal..

[30] J. Mesirov,et al. Metagene projection for cross-platform, cross-species characterization of global transcriptional states , 2007, Proceedings of the National Academy of Sciences.

[31] D. Botstein,et al. Cluster analysis and display of genome-wide expression patterns. , 1998, Proceedings of the National Academy of Sciences of the United States of America.

[32] Robert Tibshirani,et al. Cluster Validation by Prediction Strength , 2005 .

[33] D. Rubin,et al. Maximum likelihood from incomplete data via the EM - algorithm plus discussions on the paper , 1977 .

[34] H. Sebastian Seung,et al. Learning the parts of objects by non-negative matrix factorization , 1999, Nature.

[35] atherine,et al. Finding the number of clusters in a data set : An information theoretic approach C , 2003 .

[36] Zhaohui S. Qin,et al. A second generation human haplotype map of over 3.1 million SNPs , 2007, Nature.

[37] Adrian E. Raftery,et al. Model-Based Clustering, Discriminant Analysis, and Density Estimation , 2002 .

[38] Robert Tibshirani,et al. Estimating the number of clusters in a data set via the gap statistic , 2000 .

[39] Ali S. Hadi,et al. Finding Groups in Data: An Introduction to Chster Analysis , 1991 .

[40] Jun S. Liu,et al. Bayesian Clustering with Variable and Transformation Selections , 2003 .

[41] R. Tibshirani,et al. Complementary hierarchical clustering. , 2008, Biostatistics.

[42] Christian A. Rees,et al. Molecular portraits of human breast tumours , 2000, Nature.