MAD-Bayes: MAP-based Asymptotic Derivations from Bayes

The classical mixture of Gaussians model is related to K-means via small-variance asymptotics: as the covariances of the Gaussians tend to zero, the negative log-likelihood of the mixture of Gaussians model approaches the K-means objective, and the EM algorithm approaches the K-means algorithm. Kulis & Jordan (2012) used this observation to obtain a novel K-means-like algorithm from a Gibbs sampler for the Dirichlet process (DP) mixture. We instead consider applying small-variance asymptotics directly to the posterior in Bayesian nonparametric models. This framework is independent of any specific Bayesian inference algorithm, and it has the major advantage that it generalizes immediately to a range of models beyond the DP mixture. To illustrate, we apply our framework to the feature learning setting, where the beta process and Indian buffet process provide an appropriate Bayesian nonparametric prior. We obtain a novel objective function that goes beyond clustering to learn (and penalize new) groupings for which we relax the mutual exclusivity and exhaustivity assumptions of clustering. We demonstrate several other algorithms, all of which are scalable and simple to implement. Empirical results demonstrate the benefits of the new framework.

[1]  D. Blackwell,et al.  Ferguson Distributions Via Polya Urn Schemes , 1973 .

[2]  C. Antoniak Mixtures of Dirichlet Processes with Applications to Bayesian Nonparametric Problems , 1974 .

[3]  H. Akaike A new look at the statistical model identification , 1974 .

[4]  A. D. Gordon,et al.  An Algorithm for Euclidean Sum of Squares Classification , 1977 .

[5]  J. Kingman The Representation of Partition Structures , 1978 .

[6]  D. Aldous Exchangeability and related topics , 1985 .

[7]  N. Hjort Nonparametric Bayes Estimators Based on Beta Processes in Models for Life History Data , 1990 .

[8]  M. Escobar Estimating Normal Means with a Dirichlet Process Prior , 1994 .

[9]  Jun S. Liu,et al.  The Collapsed Gibbs Sampler in Bayesian Computations with Applications to a Gene Regulation Problem , 1994 .

[10]  J. Pitman Exchangeable and partially exchangeable random partitions , 1995 .

[11]  M. Escobar,et al.  Bayesian Density Estimation and Inference Using Mixtures , 1995 .

[12]  Tomaso A. Poggio,et al.  Example-Based Learning for View-Based Human Face Detection , 1998, IEEE Trans. Pattern Anal. Mach. Intell..

[13]  Pedro Larrañaga,et al.  An empirical comparison of four initialization methods for the K-Means algorithm , 1999, Pattern Recognit. Lett..

[14]  Thomas L. Griffiths,et al.  Infinite latent feature models and the Indian buffet process , 2005, NIPS.

[15]  Marc Teboulle,et al.  Grouping Multidimensional Data - Recent Advances in Clustering , 2006 .

[16]  Pavel Berkhin,et al.  A Survey of Clustering Data Mining Techniques , 2006, Grouping Multidimensional Data.

[17]  Douglas Steinley,et al.  K-means clustering: a half-century synthesis. , 2006, The British journal of mathematical and statistical psychology.

[18]  Sergei Vassilvitskii,et al.  k-means++: the advantages of careful seeding , 2007, SODA '07.

[19]  Michael I. Jordan,et al.  Hierarchical Beta Processes and the Indian Buffet Process , 2007, AISTATS.

[20]  Anil K. Jain Data clustering: 50 years beyond K-means , 2008, Pattern Recognit. Lett..

[21]  C. Thomaz,et al.  A new ranking method for principal components analysis and its application to face image analysis , 2010, Image Vis. Comput..

[22]  Michael I. Jordan,et al.  Beta Processes, Stick-Breaking and Power Laws , 2011, 1106.0539.

[23]  Michael I. Jordan,et al.  Revisiting k-means: New Algorithms via Bayesian Nonparametrics , 2011, ICML.

[24]  Michael I. Jordan,et al.  Feature allocations, probability functions, and paintboxes , 2013, 1301.6647.

[25]  Tamara Broderick,et al.  Clusters and Features from Combinatorial Stochastic Processes , 2014 .