Semi-supervised projected model-based clustering

We present an adaptation of model-based clustering for partially labeled data, that is capable of finding hidden cluster labels. All the originally known and discoverable clusters are represented using localized feature subset selections (subspaces), obtaining clusters unable to be discovered by global feature subset selection. The semi-supervised projected model-based clustering algorithm (SeSProC) also includes a novel model selection approach, using a greedy forward search to estimate the final number of clusters. The quality of SeSProC is assessed using synthetic data, demonstrating its effectiveness, under different data conditions, not only at classifying instances with known labels, but also at discovering completely hidden clusters in different subspaces. Besides, SeSProC also outperforms three related baseline algorithms in most scenarios using synthetic and real data sets.

[1]  Adrian E. Raftery,et al.  mclust Version 4 for R : Normal Mixture Modeling for Model-Based Clustering , Classification , and Density Estimation , 2012 .

[2]  J. Friedman Clustering objects on subsets of attributes , 2002 .

[3]  M. Cugmas,et al.  On comparing partitions , 2015 .

[4]  Peter D. Hoff,et al.  Model-based subspace clustering , 2006 .

[5]  Arthur Zimek,et al.  A survey on enhanced subspace clustering , 2013, Data Mining and Knowledge Discovery.

[6]  Xiaojin Zhu,et al.  Semi-Supervised Learning Literature Survey , 2005 .

[7]  Peter D. Hoff,et al.  Subset Clustering of Binary Sequences, with an Application to Genomic Abnormality Data , 2005, Biometrics.

[8]  Nitesh V. Chawla,et al.  Learning From Labeled And Unlabeled Data: An Empirical Study Across Techniques And Domains , 2011, J. Artif. Intell. Res..

[9]  Jing Hua,et al.  Simultaneous Localized Feature Selection and Model Detection for Gaussian Mixtures , 2009, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[10]  D. Hand,et al.  Clustering objects on subsets of attributes , 2004 .

[11]  Philip S. Yu,et al.  Finding generalized projected clusters in high dimensional spaces , 2000, SIGMOD '00.

[12]  Michael K. Ng,et al.  HARP: a practical projected clustering algorithm , 2004, IEEE Transactions on Knowledge and Data Engineering.

[13]  Shili Lin,et al.  Class discovery and classification of tumor samples using mixture modeling of gene expression data - a unified approach , 2004, Bioinform..

[14]  Tomer Hertz,et al.  Computing Gaussian Mixture Models with EM Using Equivalence Constraints , 2003, NIPS.

[15]  姜青山 Model-based Method for Projective Clustering , 2012 .

[16]  Latifur Khan,et al.  SISC: A Text Classification Approach Using Semi Supervised Subspace Clustering , 2009, 2009 IEEE International Conference on Data Mining Workshops.

[17]  Huan Liu,et al.  Subspace clustering for high dimensional data: a review , 2004, SKDD.

[18]  George Kesidis,et al.  Semisupervised mixture modeling with fine-grained component-conditional class labeling and transductive inference , 2009, 2009 IEEE International Workshop on Machine Learning for Signal Processing.

[19]  Anil K. Jain,et al.  Simultaneous feature selection and clustering using mixture models , 2004, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[20]  Hans-Peter Kriegel,et al.  Density Based Subspace Clustering over Dynamic Data , 2011, SSDBM.

[21]  Ranjan Maitra,et al.  Simulating Data to Study Performance of Finite Mixture Modeling and Clustering Algorithms , 2010 .

[22]  Volodymyr Melnykov,et al.  Finite mixture models and model-based clustering , 2010 .

[23]  Xianchao Zhang,et al.  Constraint Based Dimension Correlation and Distance Divergence for Clustering High-Dimensional Data , 2010, 2010 IEEE International Conference on Data Mining.

[24]  Myoung-Ho Kim,et al.  FINDIT: a fast and intelligent subspace clustering algorithm using dimension voting , 2004, Inf. Softw. Technol..

[25]  Martin Ester,et al.  Robust projected clustering , 2007, Knowledge and Information Systems.

[26]  T. Seidl,et al.  ASCLU : Alternative Subspace Clustering , 2010 .

[27]  Yi Zhang,et al.  Entropy-based subspace clustering for mining numerical data , 1999, KDD '99.

[28]  J. Friedman,et al.  Clustering objects on subsets of attributes (with discussion) , 2004 .

[29]  Xiaojin Zhu,et al.  Introduction to Semi-Supervised Learning , 2009, Synthesis Lectures on Artificial Intelligence and Machine Learning.

[30]  Geoffrey J. McLachlan,et al.  Finite Mixture Models , 2019, Annual Review of Statistics and Its Application.

[31]  David J. Miller,et al.  Unsupervised learning of parsimonious mixtures on large spaces with integrated feature and component selection , 2006, IEEE Transactions on Signal Processing.

[32]  D. Rubin,et al.  Maximum likelihood from incomplete data via the EM - algorithm plus discussions on the paper , 1977 .

[33]  Alexander Zien,et al.  Semi-Supervised Learning , 2006 .

[34]  Michael K. Ng,et al.  On discovery of extremely low-dimensional clusters using semi-supervised projected clustering , 2005, 21st International Conference on Data Engineering (ICDE'05).

[35]  Ian Witten,et al.  Data Mining , 2000 .

[36]  Christos Faloutsos,et al.  Finding Clusters in subspaces of very large, multi-dimensional datasets , 2010, 2010 IEEE 26th International Conference on Data Engineering (ICDE 2010).

[37]  Jing Hua,et al.  A Gaussian Mixture Model to Detect Clusters Embedded in Feature Subspace , 2007, Commun. Inf. Syst..

[38]  T. M. Murali,et al.  A Monte Carlo algorithm for fast projective clustering , 2002, SIGMOD '02.

[39]  David J. Miller,et al.  Joint Parsimonious Modeling and Model Order Selection for Multivariate Gaussian Mixtures , 2010, IEEE Journal of Selected Topics in Signal Processing.

[40]  Kien A. Hua,et al.  Constrained locally weighted clustering , 2008, Proc. VLDB Endow..

[41]  อนิรุธ สืบสิงห์,et al.  Data Mining Practical Machine Learning Tools and Techniques , 2014 .

[42]  Ian Davidson,et al.  Constrained Clustering: Advances in Algorithms, Theory, and Applications , 2008 .

[43]  R Core Team,et al.  R: A language and environment for statistical computing. , 2014 .

[44]  Nizar Bouguila,et al.  Model-based subspace clustering of non-Gaussian data , 2010, Neurocomputing.

[45]  Helen C. Shen,et al.  Semi-Supervised Classification Using Linear Neighborhood Propagation , 2006, 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'06).

[46]  Philip S. Yu,et al.  A Framework for Projected Clustering of High Dimensional Data Streams , 2004, VLDB.

[47]  Arthur Zimek,et al.  Clustering High-Dimensional Data , 2018, Data Clustering: Algorithms and Applications.

[48]  Geoffrey J. McLachlan,et al.  Mixture models : inference and applications to clustering , 1989 .

[49]  Ashutosh Kumar Singh,et al.  The EM Algorithm and Related Statistical Models , 2006, Technometrics.

[50]  Radford M. Neal Pattern Recognition and Machine Learning , 2007, Technometrics.

[51]  Arindam Banerjee,et al.  Active Semi-Supervision for Pairwise Constrained Clustering , 2004, SDM.

[52]  Aruna Tiwari,et al.  Constructive Semi-Supervised Classification Algorithm and Its Implement in Data Mining , 2009, PReMI.

[53]  Thomas Seidl,et al.  Subspace correlation clustering: finding locally correlated dimensions in subspace projections of the data , 2012, KDD.

[54]  Dimitrios Gunopulos,et al.  Automatic subspace clustering of high dimensional data for data mining applications , 1998, SIGMOD '98.

[55]  Xianchao Zhang,et al.  Exploiting constraint inconsistence for dimension selection in subspace clustering: A semi-supervised approach , 2011, Neurocomputing.

[56]  Joachim M. Buhmann,et al.  Learning with constrained and unlabelled data , 2005, 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'05).

[57]  Ira Assent,et al.  HSM: Heterogeneous Subspace Mining in High Dimensional Data , 2009, SSDBM.

[58]  Wei-Chen Chen,et al.  MixSim: An R Package for Simulating Data to Study Performance of Clustering Algorithms , 2012 .

[59]  Zhengdong Lu,et al.  Semi-supervised Learning with Penalized Probabilistic Clustering , 2004, NIPS.

[60]  G. Schwarz Estimating the Dimension of a Model , 1978 .

[61]  Philip S. Yu,et al.  Fast algorithms for projected clustering , 1999, SIGMOD '99.

[62]  Céline Robardet,et al.  Constraint-Based Subspace Clustering , 2009, SDM.

[63]  Adrian E. Raftery,et al.  How Many Clusters? Which Clustering Method? Answers Via Model-Based Cluster Analysis , 1998, Comput. J..

[64]  David J. Miller,et al.  A Mixture Model and EM-Based Algorithm for Class Discovery, Robust Classification, and Outlier Rejection in Mixed Labeled/Unlabeled Data Sets , 2003, IEEE Trans. Pattern Anal. Mach. Intell..

[65]  Hans-Peter Kriegel,et al.  Subspace clustering , 2012, WIREs Data Mining Knowl. Discov..