Crowdsourced Clustering: Querying Edges vs Triangles

We consider the task of clustering items using answers from non-expert crowd workers. In such cases, the workers are often not able to label the items directly, however, it is reasonable to assume that they can compare items and judge whether they are similar or not. An important question is what queries to make, and we compare two types: random edge queries, where a pair of items is revealed, and random triangles, where a triple is. Since it is far too expensive to query all possible edges and/or triangles, we need to work with partial observations subject to a fixed query budget constraint. When a generative model for the data is available (and we consider a few of these) we determine the cost of a query by its entropy; when such models do not exist we use the average response time per query of the workers as a surrogate for the cost. In addition to theoretical justification, through several simulations and experiments on two real data sets on Amazon Mechanical Turk, we empirically demonstrate that, for a fixed budget, triangle queries uniformly outperform edge queries. Even though, in contrast to edge queries, triangle queries reveal dependent edges, they provide more reliable edges and, for a fixed budget, many more of them. We also provide a sufficient condition on the number of observations, edge densities inside and outside the clusters and the minimum cluster size required for the exact recovery of the true adjacency matrix via triangle queries using a convex optimization-based clustering algorithm.

[1]  Pramod K. Varshney,et al.  Reliable Crowdsourcing for Multi-Class Labeling Using Coding Theory , 2013, IEEE Journal of Selected Topics in Signal Processing.

[2]  Jinfeng Yi,et al.  Semi-Crowdsourced Clustering: Generalizing Crowd Labeling by Robust Distance Metric Learning , 2012, NIPS.

[3]  David De Roure,et al.  Zooniverse: observing the world's largest citizen science platform , 2014, WWW.

[4]  Pietro Perona,et al.  The Multidimensional Wisdom of Crowds , 2010, NIPS.

[5]  Milos Hauskrecht,et al.  Relative Comparison Kernel Learning with Auxiliary Kernels , 2013, ECML/PKDD.

[6]  Xi Chen,et al.  Spectral Methods Meet EM: A Provably Optimal Algorithm for Crowdsourcing , 2014, J. Mach. Learn. Res..

[7]  Adam Tauman Kalai,et al.  Adaptively Learning the Crowd Kernel , 2011, ICML.

[8]  Subhransu Maji,et al.  Similarity Comparisons for Interactive Fine-Grained Categorization , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[9]  Michael I. Jordan,et al.  On Spectral Clustering: Analysis and an algorithm , 2001, NIPS.

[10]  Debra A. Fischer,et al.  Planet Hunters: New Kepler planet candidates from analysis of quarter 2 , 2013 .

[11]  Fei-Fei Li,et al.  Novel Dataset for Fine-Grained Image Categorization : Stanford Dogs , 2012 .

[12]  Devavrat Shah,et al.  Budget-Optimal Task Allocation for Reliable Crowdsourcing Systems , 2011, Oper. Res..

[13]  Richard M. Karp,et al.  Algorithms for graph partitioning on the planted partition model , 2001, Random Struct. Algorithms.

[14]  Jian Peng,et al.  Variational Inference for Crowdsourcing , 2012, NIPS.

[15]  Serge J. Belongie,et al.  Cost-Effective HITs for Relative Similarity Comparisons , 2014, HCOMP.

[16]  John C. Platt,et al.  Learning from the Wisdom of Crowds by Minimax Entropy , 2012, NIPS.

[17]  Pietro Perona,et al.  The Caltech-UCSD Birds-200-2011 Dataset , 2011 .

[18]  Pietro Perona,et al.  Crowdclustering , 2011, NIPS.

[19]  A. P. Dawid,et al.  Maximum Likelihood Estimation of Observer Error‐Rates Using the EM Algorithm , 1979 .

[20]  Babak Hassibi,et al.  Graph Clustering With Missing Data: Convex Algorithms and Analysis , 2014, NIPS.

[21]  Gerardo Hermosillo,et al.  Learning From Crowds , 2010, J. Mach. Learn. Res..

[22]  Bin Bi,et al.  Iterative Learning for Reliable Crowdsourcing Systems , 2012 .

[23]  Kilian Q. Weinberger,et al.  Stochastic triplet embedding , 2012, 2012 IEEE International Workshop on Machine Learning for Signal Processing.

[24]  M. Meilă Comparing clusterings---an information based distance , 2007 .

[25]  Hannes Heikinheimo,et al.  The Crowd-Median Algorithm , 2013, HCOMP.

[26]  Kathryn B. Laskey,et al.  Stochastic blockmodels: First steps , 1983 .

[27]  Manuel Blum,et al.  reCAPTCHA: Human-Based Character Recognition via Web Security Measures , 2008, Science.

[28]  David A. Forsyth,et al.  Utility data annotation with Amazon Mechanical Turk , 2008, 2008 IEEE Computer Society Conference on Computer Vision and Pattern Recognition Workshops.

[29]  Brendan T. O'Connor,et al.  Cheap and Fast – But is it Good? Evaluating Non-Expert Annotations for Natural Language Tasks , 2008, EMNLP.

[30]  Michael I. Jordan,et al.  Advances in Neural Information Processing Systems 30 , 1995 .