Discrete model-based clustering with overlapping subsets of attributes

Traditional model-based clustering methods assume that data instances can be grouped in a single “best” way. This is often untrue for complex data, where several meaningful sets of clusters may exist, each of them associated to a unique subset of data attributes. Current literature has approached this problem with models that consider disjoint subsets of attributes to define distinct clustering solutions. Each solution being represented by a cluster variable. However, restricting attributes to a single cluster variable diminishes the expressiveness and quality of these models. For this reason, we propose a novel kind of models that allows cluster variables to have overlapping subsets of attributes between them. In order to learn these models, we propose to combine a searchbased method with an attribute clustering procedure. Experimental results with both synthetic and real-world data show the utility of our approach and its competitiveness with the state-of-the-art.

[1]  Nir Friedman,et al.  Learning Belief Networks in the Presence of Missing Values and Hidden Variables , 1997, ICML.

[2]  Douglas B. Kell,et al.  Computational cluster validation in post-genomic data analysis , 2005, Bioinform..

[3]  Tarald O. Kvålseth,et al.  Entropy and Correlation: Some Comments , 1987, IEEE Transactions on Systems, Man, and Cybernetics.

[4]  Ying Cui,et al.  Learning multiple nonredundant clusterings , 2010, TKDD.

[5]  Tao Chen,et al.  Model-based multidimensional clustering of categorical data , 2012, Artif. Intell..

[6]  E. Hellinger,et al.  Neue Begründung der Theorie quadratischer Formen von unendlichvielen Veränderlichen. , 1909 .

[7]  Leonard K. M. Poon,et al.  UC-LTM: Unidimensional clustering using latent tree models for discrete data , 2018, Int. J. Approx. Reason..

[8]  C. N. Liu,et al.  Approximating discrete probability distributions with dependence trees , 1968, IEEE Trans. Inf. Theory.

[9]  Thomas Hofmann,et al.  Non-redundant data clustering , 2006, Knowledge and Information Systems.

[10]  Judea Pearl,et al.  Probabilistic reasoning in intelligent systems - networks of plausible inference , 1991, Morgan Kaufmann series in representation and reasoning.

[11]  David Maxwell Chickering,et al.  Efficient Approximations for the Marginal Likelihood of Bayesian Networks with Hidden Variables , 1997, Machine Learning.

[12]  Tengfei Liu,et al.  Greedy learning of latent tree models for multidimensional clustering , 2013, Machine Learning.

[13]  Tengfei Liu,et al.  Model-based clustering of high-dimensional data: Variable selection versus facet determination , 2013, Int. J. Approx. Reason..

[14]  Nevin Lianwen Zhang,et al.  Hierarchical latent class models for cluster analysis , 2002, J. Mach. Learn. Res..

[15]  Nir Friedman,et al.  Probabilistic Graphical Models - Principles and Techniques , 2009 .

[16]  D. Rubin,et al.  Maximum likelihood from incomplete data via the EM - algorithm plus discussions on the paper , 1977 .

[17]  Ian Davidson,et al.  Finding Alternative Clusterings Using Constraints , 2008, 2008 Eighth IEEE International Conference on Data Mining.