论文信息 - A Strongly Consistent Sparse k-means Clustering with Direct l1 Penalization on Variable Weights

A Strongly Consistent Sparse k-means Clustering with Direct l1 Penalization on Variable Weights

We propose the Lasso Weighted $k$-means ($LW$-$k$-means) algorithm as a simple yet efficient sparse clustering procedure for high-dimensional data where the number of features ($p$) can be much larger compared to the number of observations ($n$). In the $LW$-$k$-means algorithm, we introduce a lasso-based penalty term, directly on the feature weights to incorporate feature selection in the framework of sparse clustering. $LW$-$k$-means does not make any distributional assumption of the given dataset and thus, induces a non-parametric method for feature selection. We also analytically investigate the convergence of the underlying optimization procedure in $LW$-$k$-means and establish the strong consistency of our algorithm. $LW$-$k$-means is tested on several real-life and synthetic datasets and through detailed experimental analysis, we find that the performance of the method is highly competitive against some state-of-the-art procedures for clustering and feature selection, not only in terms of clustering accuracy but also with respect to computational time.

Swagatam Das | Saptarshi Chakraborty | Saptarshi Chakraborty | Swagatam Das

[1] A. Asuncion,et al. UCI Machine Learning Repository, University of California, Irvine, School of Information and Computer Sciences , 2007 .

[2] Joshua Zhexue Huang,et al. Extensions to the k-Means Algorithm for Clustering Large Data Sets with Categorical Values , 1998, Data Mining and Knowledge Discovery.

[3] Michael K. Ng,et al. Automated variable weighting in k-means type clustering , 2005, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[4] Edward R. Dougherty,et al. Reporting bias when using real data sets to analyze classification performance , 2010, Bioinform..

[5] Yunming Ye,et al. Weighting Method for Feature Selection in K-Means , 2007 .

[6] Paul D. McNicholas,et al. Model-Based Clustering , 2016, Journal of Classification.

[7] Jesús Alcalá-Fdez,et al. KEEL Data-Mining Software Tool: Data Set Repository, Integration of Algorithms and Experimental Analysis Framework , 2011, J. Multiple Valued Log. Soft Comput..

[8] Jian Yu,et al. A Novel Fuzzy C-Means Clustering Algorithm , 2006, RSKT.

[9] Yoshikazu Terada. Strong consistency of factorial $$K$$K-means clustering , 2013 .

[10] E. Lander,et al. Classification of human lung carcinomas by mRNA expression profiling reveals distinct adenocarcinoma subclasses , 2001, Proceedings of the National Academy of Sciences of the United States of America.

[11] Chieh-Yuan Tsai,et al. Developing a feature weight self-adjustment mechanism for a K-means clustering algorithm , 2008, Comput. Stat. Data Anal..

[12] Saptarshi Chakraborty,et al. On the strong consistency of feature‐weighted k‐means clustering in a nearmetric space , 2019, Stat.

[13] H. Kiers,et al. Factorial k-means analysis for two-way data , 2001 .

[14] Yunming Ye,et al. A feature group weighting method for subspace clustering of high-dimensional data , 2012, Pattern Recognit..

[15] Wei Pan,et al. Penalized Model-Based Clustering with Application to Variable Selection , 2007, J. Mach. Learn. Res..

[16] S. Ramaswamy,et al. Translation of microarray data into clinically relevant cancer diagnostic tests using gene expression ratios in lung cancer and mesothelioma. , 2002, Cancer research.

[17] Ery Arias-Castro,et al. Detection and Feature Selection in Sparse Mixture Models , 2014, 1405.1478.

[18] Thomas Jech,et al. About the Axiom of Choice , 1973 .

[19] Jiashun Jin,et al. Phase Transitions for High Dimensional Clustering and Related Problems , 2015, 1502.06952.

[20] Jonathan Goldstein,et al. When Is ''Nearest Neighbor'' Meaningful? , 1999, ICDT.

[21] Yoshikazu Terada,et al. Strong Consistency of Reduced K‐means Clustering , 2012, 1212.4942.

[22] J. Friedman,et al. Clustering objects on subsets of attributes (with discussion) , 2004 .

[23] Gunter Ritter,et al. Strong consistency of k-parameters clustering , 2013, J. Multivar. Anal..

[24] Ronald L. Rivest,et al. Training a 3-node neural network is NP-complete , 1988, COLT '88.

[25] Geoffrey J. McLachlan,et al. On the number of components in a Gaussian mixture model , 2014, Wiley Interdiscip. Rev. Data Min. Knowl. Discov..

[26] Adrian E. Raftery,et al. How Many Clusters? Which Clustering Method? Answers Via Model-Based Cluster Analysis , 1998, Comput. J..

[27] Wei Sun,et al. Regularized k-means clustering of high-dimensional data and its asymptotic consistency , 2012 .

[28] Renato Cordeiro de Amorim,et al. A Survey on Feature Weighting Based K-Means Algorithms , 2015, Journal of Classification.

[29] D. Pollard. Strong Consistency of $K$-Means Clustering , 1981 .

[30] Ka-Chun Wong,et al. A Short Survey on Data Clustering Algorithms , 2015, 2015 Second International Conference on Soft Computing and Machine Intelligence (ISCMI).

[31] Huan Liu,et al. Toward integrating feature selection algorithms for classification and clustering , 2005, IEEE Transactions on Knowledge and Data Engineering.

[32] Ash A. Alizadeh,et al. Distinct types of diffuse large B-cell lymphoma identified by gene expression profiling , 2000, Nature.

[33] Robert Tibshirani,et al. Estimating the number of clusters in a data set via the gap statistic , 2000 .

[34] Rui Xu,et al. Survey of clustering algorithms , 2005, IEEE Transactions on Neural Networks.

[35] Renato Cordeiro de Amorim,et al. Minkowski metric, feature weighting and anomalous cluster initializing in K-Means clustering , 2012, Pattern Recognit..

[36] P. Tseng. Convergence of a Block Coordinate Descent Method for Nondifferentiable Minimization , 2001 .

[37] Jiashun Jin,et al. Rejoinder: “Influential features PCA for high dimensional clustering” , 2016 .

[38] Vladimir Nikulin,et al. Strong consistency of the prototype based clustering in probabilistic space , 2010, J. Mach. Learn. Res..

[39] W. Rudin. Real and complex analysis , 1968 .

[40] D. Donoho,et al. Higher criticism thresholding: Optimal feature selection when useful features are rare and weak , 2008, Proceedings of the National Academy of Sciences.

[41] J. Carroll,et al. Synthesized clustering: A method for amalgamating alternative clustering bases with differential weighting of variables , 1984 .

[42] R. Tibshirani. Regression Shrinkage and Selection via the Lasso , 1996 .

[43] Jennifer G. Dy. Unsupervised Feature Selection , 2007 .

[44] Robert Tibshirani,et al. A Framework for Feature Selection in Clustering , 2010, Journal of the American Statistical Association.

[45] W. Scott Spangler,et al. Feature Weighting in k-Means Clustering , 2003, Machine Learning.

[46] Ron Kohavi,et al. Wrappers for Feature Subset Selection , 1997, Artif. Intell..

[47] Michael K. Ng,et al. An optimization algorithm for clustering using weighted dissimilarity measures , 2004, Pattern Recognit..

[48] J. Mesirov,et al. Molecular classification of cancer: class discovery and class prediction by gene expression monitoring. , 1999, Science.

[49] B. Matthews. Comparison of the predicted and observed secondary structure of T4 phage lysozyme. , 1975, Biochimica et biophysica acta.

[50] E. L. Lehmann,et al. Theory of point estimation , 1950 .

[51] Ery Arias-Castro,et al. A simple approach to sparse clustering , 2017, Comput. Stat. Data Anal..

[52] Larry A. Wasserman,et al. Minimax Theory for High-dimensional Gaussian Mixtures with Sparse Mean Separation , 2013, NIPS.

[53] Michael K. Ng,et al. An Entropy Weighting k-Means Algorithm for Subspace Clustering of High-Dimensional Sparse Data , 2007, IEEE Transactions on Knowledge and Data Engineering.

[54] Anil K. Jain. Data clustering: 50 years beyond K-means , 2008, Pattern Recognit. Lett..

[55] David W. Aha,et al. A Review and Empirical Evaluation of Feature Weighting Methods for a Class of Lazy Learning Algorithms , 1997, Artificial Intelligence Review.

[56] J. Carroll,et al. K-means clustering in a low-dimensional Euclidean space , 1994 .

[57] J. MacQueen. Some methods for classification and analysis of multivariate observations , 1967 .