A Framework for Feature Selection in Clustering

We consider the problem of clustering observations using a potentially large set of features. One might expect that the true underlying clusters present in the data differ only with respect to a small fraction of the features, and will be missed if one clusters the observations using the full set of features. We propose a novel framework for sparse clustering, in which one clusters the observations using an adaptively chosen subset of the features. The method uses a lasso-type penalty to select the features. We use this framework to develop simple methods for sparse K-means and sparse hierarchical clustering. A single criterion governs both the selection of the features and the resulting clusters. These approaches are demonstrated on simulated and genomic data.

[1]  A Vassault,et al.  [Examination procedures]. , 2010, Annales de biologie clinique.

[2]  Russ B. Altman,et al.  Missing value estimation methods for DNA microarrays , 2001, Bioinform..

[3]  D. Reich,et al.  Principal components analysis corrects for stratification in genome-wide association studies , 2006, Nature Genetics.

[4]  J. Neyman,et al.  Mathematical Statistics and Probability , 1962 .

[5]  J. Friedman,et al.  Clustering objects on subsets of attributes (with discussion) , 2004 .

[6]  William M. Rand,et al.  Objective Criteria for the Evaluation of Clustering Methods , 1971 .

[7]  A. Raftery,et al.  Variable Selection for Model-Based Clustering , 2006 .

[8]  M. Olivier A haplotype map of the human genome , 2003, Nature.

[9]  R. Tibshirani,et al.  A penalized matrix decomposition, with applications to sparse principal components and canonical correlation analysis. , 2009, Biostatistics.

[10]  Jinchi Lv,et al.  A unified approach to model selection and sparse recovery using regularized least squares , 2009, 0905.3573.

[11]  Stephen P. Boyd,et al.  Convex Optimization , 2004, Algorithms and Theory of Computation Handbook.

[12]  Wei-Chien Chang On using Principal Components before Separating a Mixture of Two Multivariate Normal Distributions , 1983 .

[13]  J. MacQueen Some methods for classification and analysis of multivariate observations , 1967 .

[14]  Robert Tibshirani,et al.  Hybrid hierarchical clustering with applications to microarray data. , 2005, Biostatistics.

[15]  Peter J. Rousseeuw,et al.  Finding Groups in Data: An Introduction to Cluster Analysis , 1991 .

[16]  Ravi Kothari,et al.  On finding the number of clusters , 1999, Pattern Recognit. Lett..

[17]  J. Friedman Clustering objects on subsets of attributes , 2002 .

[18]  Geoffrey J. McLachlan,et al.  A mixture model-based approach to the clustering of microarray expression data , 2002, Bioinform..

[19]  H. Sebastian Seung,et al.  Algorithms for Non-negative Matrix Factorization , 2000, NIPS.

[20]  Wei Pan,et al.  Penalized Model-Based Clustering with Application to Variable Selection , 2007, J. Mach. Learn. Res..

[21]  Wei Pan,et al.  Penalized model-based clustering with cluster-specific diagonal covariance matrices and grouped variables. , 2008, Electronic journal of statistics.

[22]  André Hardy,et al.  An examination of procedures for determining the number of clusters in a data set , 1994 .

[23]  Ji Zhu,et al.  Variable Selection for Model‐Based High‐Dimensional Clustering and Its Application to Microarray Data , 2008, Biometrics.

[24]  Wei Zhang,et al.  Penalized Model-Based Clustering , 2009 .

[25]  Jianqing Fan,et al.  Variable Selection via Nonconcave Penalized Likelihood and its Oracle Properties , 2001 .

[26]  Debashis Ghosh,et al.  Mixture modelling of gene expression data from microarray experiments , 2002, Bioinform..

[27]  G. Celeux,et al.  Variable Selection for Clustering with Gaussian Mixture Models , 2009, Biometrics.

[28]  P. Deb Finite Mixture Models , 2008 .

[29]  Geoffrey J. McLachlan,et al.  Modelling high-dimensional data by mixtures of factor analyzers , 2003, Comput. Stat. Data Anal..

[30]  J. Mesirov,et al.  Metagene projection for cross-platform, cross-species characterization of global transcriptional states , 2007, Proceedings of the National Academy of Sciences.

[31]  D. Botstein,et al.  Cluster analysis and display of genome-wide expression patterns. , 1998, Proceedings of the National Academy of Sciences of the United States of America.

[32]  Robert Tibshirani,et al.  Cluster Validation by Prediction Strength , 2005 .

[33]  D. Rubin,et al.  Maximum likelihood from incomplete data via the EM - algorithm plus discussions on the paper , 1977 .

[34]  H. Sebastian Seung,et al.  Learning the parts of objects by non-negative matrix factorization , 1999, Nature.

[35]  atherine,et al.  Finding the number of clusters in a data set : An information theoretic approach C , 2003 .

[36]  Zhaohui S. Qin,et al.  A second generation human haplotype map of over 3.1 million SNPs , 2007, Nature.

[37]  Adrian E. Raftery,et al.  Model-Based Clustering, Discriminant Analysis, and Density Estimation , 2002 .

[38]  Robert Tibshirani,et al.  Estimating the number of clusters in a data set via the gap statistic , 2000 .

[39]  Ali S. Hadi,et al.  Finding Groups in Data: An Introduction to Chster Analysis , 1991 .

[40]  Jun S. Liu,et al.  Bayesian Clustering with Variable and Transformation Selections , 2003 .

[41]  R. Tibshirani,et al.  Complementary hierarchical clustering. , 2008, Biostatistics.

[42]  Christian A. Rees,et al.  Molecular portraits of human breast tumours , 2000, Nature.