X-means: Extending K-means with Efficient Estimation of the Number of Clusters

Despite its popularity for general clustering, K-means suuers three major shortcomings; it scales poorly computationally, the number of clusters K has to be supplied by the user, and the search is prone to local minima. We propose solutions for the rst two problems, and a partial remedy for the third. Building on prior work for algorithmic acceleration that is not based on approximation, we introduce a new algorithm that eeciently, searches the space of cluster locations and number of clusters to optimize the Bayesian Information Criterion (BIC) or the Akaike Information Criterion (AIC) measure. The innovations include two new ways of exploiting cached suucient statistics and a new very eecient test that in one K-means sweep selects the most promising subset of classes for reenement. This gives rise to a fast, statistically founded algorithm that outputs both the number of classes and their parameters. Experiments show this technique reveals the true number of classes in the underlying distribution , and that it is much faster than repeatedly using accelerated K-means for different values of K.