A Survey of Semi-Supervised Clustering Algorithms: from a priori scheme to interactive scheme and open issues

In the last 10 years, semi-supervised clustering (SSC) or clustering with side information has received significant attention from researchers because of its success in many applications like document, image clustering, etc. SSC has been shown to improve the clustering performance substantially with just few constraints or labelled data points as side information which are provided by an expert or an oracle system. Most works have been done so far can be classified into one of two SSC schemes: the a-prior scheme, and the interactive scheme. This survey will cover these two schemes together with the important algorithms in each scheme. Finally, the open issues will also be summarized in the survey.

[1]  Anil K. Jain,et al.  Data clustering: a review , 1999, CSUR.

[2]  Zhengdong Lu,et al.  Pairwise Constraints as Priors in Probabilistic Clustering , 2008 .

[3]  Raymond J. Mooney,et al.  A probabilistic framework for semi-supervised clustering , 2004, KDD.

[4]  Andrew McCallum,et al.  Semi-Supervised Clustering with User Feedback , 2003 .

[5]  Ian Davidson,et al.  Measuring Constraint-Set Utility for Partitional Clustering Algorithms , 2006, PKDD.

[6]  Balaji Padmanabhan,et al.  Segmenting customer transactions using a pattern-based clustering approach , 2003, Third IEEE International Conference on Data Mining.

[7]  Arindam Banerjee,et al.  Active Semi-Supervision for Pairwise Constrained Clustering , 2004, SDM.

[8]  Stochastic Relaxation , 2014, Computer Vision, A Reference Guide.

[9]  Arindam Banerjee,et al.  Semi-supervised Clustering by Seeding , 2002, ICML.

[10]  Nicolas Labroche,et al.  Active Learning for Semi-Supervised K-Means Clustering , 2010, 2010 22nd IEEE International Conference on Tools with Artificial Intelligence.

[11]  Thomas Hofmann,et al.  Conditional Information Bottleneck Clustering , 2008 .

[12]  Dan Pelleg,et al.  K -Means with Large and Noisy Constraint Sets , 2007, ECML.

[13]  Qing He,et al.  Effective semi-supervised document clustering via active learning with instance-level constraints , 2011, Knowledge and Information Systems.

[14]  Claude E. Shannon,et al.  The mathematical theory of communication , 1950 .

[15]  Yoav Freund,et al.  Boosting the margin: A new explanation for the effectiveness of voting methods , 1997, ICML.

[16]  Tomer Hertz,et al.  Boosting margin based distance functions for clustering , 2004, ICML.

[17]  Yi Liu,et al.  BoostCluster: boosting clustering by pairwise constraints , 2007, KDD '07.

[18]  Ayhan Demiriz,et al.  Clustering with Balancing Constraints , 2008 .

[19]  Joydeep Ghosh,et al.  Relationship-Based Clustering and Visualization for High-Dimensional Data Mining , 2003, INFORMS J. Comput..

[20]  Sang Joon Kim,et al.  A Mathematical Theory of Communication , 2006 .

[21]  Dan Klein,et al.  From Instance-level Constraints to Space-Level Constraints: Making the Most of Prior Knowledge in Data Clustering , 2002, ICML.

[22]  Dimitri P. Bertsekas,et al.  Linear network optimization - algorithms and codes , 1991 .

[23]  Raymond J. Mooney,et al.  Integrating constraints and metric learning in semi-supervised clustering , 2004, ICML.

[24]  Indrajit Bhattacharya,et al.  Using Assignment Constraints to Avoid Empty Clusters in k-Means Clustering , 2008 .

[25]  Robin Sibson,et al.  SLINK: An Optimally Efficient Algorithm for the Single-Link Cluster Method , 1973, Comput. J..

[26]  Tomer Hertz,et al.  Learning a Mahalanobis Metric from Equivalence Constraints , 2005, J. Mach. Learn. Res..

[27]  D. Defays,et al.  An Efficient Algorithm for a Complete Link Method , 1977, Comput. J..

[28]  Marina Meila,et al.  Comparing Clusterings by the Variation of Information , 2003, COLT.

[29]  J. M. Hammersley,et al.  Markov fields on finite graphs and lattices , 1971 .

[30]  Michael I. Jordan,et al.  On Spectral Clustering: Analysis and an algorithm , 2001, NIPS.

[31]  S. S. Ravi,et al.  Clustering with Constraints: Feasibility Issues and the k-Means Algorithm , 2005, SDM.

[32]  Wai Lam,et al.  Active Learning of Constraints for Semi-supervised Text Clustering , 2007, SDM.

[33]  Thomas Hofmann,et al.  Non-redundant clustering with conditional ensembles , 2005, KDD '05.

[34]  Yoram Singer,et al.  Improved Boosting Algorithms Using Confidence-rated Predictions , 1998, COLT' 98.

[35]  Ian Davidson,et al.  Constrained Clustering: Advances in Algorithms, Theory, and Applications , 2008 .

[36]  Ellen M. Voorhees,et al.  Implementing agglomerative hierarchic clustering algorithms for use in document retrieval , 1986, Inf. Process. Manag..

[37]  Claire Cardie,et al.  Proceedings of the Eighteenth International Conference on Machine Learning, 2001, p. 577–584. Constrained K-means Clustering with Background Knowledge , 2022 .

[38]  Raymond J. Mooney,et al.  Semi-supervised clustering: probabilistic models, algorithms and experiments , 2005 .

[39]  Shiri Gordon,et al.  Applying the information bottleneck principle to unsupervised clustering of discrete and continuous image representations , 2003, Proceedings Ninth IEEE International Conference on Computer Vision.

[40]  Kiri Wagstaff,et al.  Value, Cost, and Sharing: Open Issues in Constrained Clustering , 2006, KDID.

[41]  Naftali Tishby,et al.  Distributional Clustering of English Words , 1993, ACL.

[42]  Michael I. Jordan,et al.  Distance Metric Learning with Application to Clustering with Side-Information , 2002, NIPS.

[43]  Donald Geman,et al.  Stochastic Relaxation, Gibbs Distributions, and the Bayesian Restoration of Images , 1984, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[44]  Ian Davidson,et al.  Reveling in Constraints , 2009, ACM Queue.

[45]  Catherine Blake,et al.  UCI Repository of machine learning databases , 1998 .

[46]  Rong Jin,et al.  Active query selection for semi-supervised clustering , 2008, 2008 19th International Conference on Pattern Recognition.

[47]  Hans-Peter Kriegel,et al.  A Density-Based Algorithm for Discovering Clusters in Large Spatial Databases with Noise , 1996, KDD.

[48]  Naftali Tishby,et al.  The information bottleneck method , 2000, ArXiv.

[49]  Daphna Weinshall,et al.  Gaussian Mixture Models with Equivalence Constraints , 2008 .

[50]  Thomas Hofmann,et al.  Non-redundant data clustering , 2004, Fourth IEEE International Conference on Data Mining (ICDM'04).

[51]  Derek Greene,et al.  Constraint Selection by Committee: An Ensemble Approach to Identifying Informative Constraints for Semi-supervised Clustering , 2007, ECML.

[52]  Ntc Business Books Nielsen : category management : positioning your organization to win , 1992 .