Scaling Up Crowd-Sourcing to Very Large Datasets: A Case for Active Learning

Crowd-sourcing has become a popular means of acquiring labeled data for many tasks where humans are more accurate than computers, such as image tagging, entity resolution, and sentiment analysis. However, due to the time and cost of human labor, solutions that rely solely on crowd-sourcing are often limited to small datasets (i.e., a few thousand items). This paper proposes algorithms for integrating machine learning into crowd-sourced databases in order to combine the accuracy of human labeling with the speed and cost-effectiveness of machine learning classifiers. By using active learning as our optimization strategy for labeling tasks in crowd-sourced databases, we can minimize the number of questions asked to the crowd, allowing crowd-sourced applications to scale (i.e., label much larger datasets at lower costs). Designing active learning algorithms for a crowd-sourced database poses many practical challenges: such algorithms need to be generic, scalable, and easy to use, even for practitioners who are not machine learning experts. We draw on the theory of nonparametric bootstrap to design, to the best of our knowledge, the first active learning algorithms that meet all these requirements. Our results, on 3 real-world datasets collected with Amazons Mechanical Turk, and on 15 UCI datasets, show that our methods on average ask 1--2 orders of magnitude fewer questions than the baseline, and 4.5--44× fewer than existing active learning algorithms.

[1]  John Langford,et al.  Agnostic Active Learning Without Constraints , 2010, NIPS.

[2]  Jennifer Widom,et al.  CrowdScreen: algorithms for filtering data with humans , 2012, SIGMOD Conference.

[3]  Purnamrita Sarkar,et al.  The Big Data Bootstrap , 2012, ICML.

[4]  John C. Platt,et al.  Learning from the Wisdom of Crowds by Minimax Entropy , 2012, NIPS.

[5]  David A. Cohn,et al.  Active Learning with Statistical Models , 1996, NIPS.

[6]  Purnamrita Sarkar,et al.  Crowdsourced enumeration queries , 2013, 2013 IEEE 29th International Conference on Data Engineering (ICDE).

[7]  Hinrich Schütze,et al.  Active Learning with Amazon Mechanical Turk , 2011, EMNLP.

[8]  Andreas Vlachos,et al.  A stopping criterion for active learning , 2008, Comput. Speech Lang..

[9]  John Langford,et al.  Para-active learning , 2013, ArXiv.

[10]  Aniket Kittur,et al.  CrowdForge: crowdsourcing complex work , 2011, UIST.

[11]  M. Kenward,et al.  An Introduction to the Bootstrap , 2007 .

[12]  Derek Greene,et al.  Using Crowdsourcing and Active Learning to Track Sentiment in Online Media , 2010, ECAI.

[13]  Jaime G. Carbonell,et al.  Efficiently learning the accuracy of labeling sources for selective sampling , 2009, KDD.

[14]  Radford M. Neal Pattern Recognition and Machine Learning , 2007, Technometrics.

[15]  David R. Karger,et al.  Counting with the Crowd , 2012, Proc. VLDB Endow..

[16]  Andrew Zisserman,et al.  Image Classification using Random Forests and Ferns , 2007, 2007 IEEE 11th International Conference on Computer Vision.

[17]  Kristen Grauman,et al.  Cost-Sensitive Active Visual Category Learning , 2010, International Journal of Computer Vision.

[18]  Chia-Hua Ho,et al.  Active learning strategies using SVMs , 2010, The 2010 International Joint Conference on Neural Networks (IJCNN).

[19]  Pietro Perona,et al.  Learning Generative Visual Models from Few Training Examples: An Incremental Bayesian Approach Tested on 101 Object Categories , 2004, 2004 Conference on Computer Vision and Pattern Recognition Workshop.

[20]  David R. Karger,et al.  Human-powered Sorts and Joins , 2011, Proc. VLDB Endow..

[21]  Carlo Zaniolo,et al.  The analytical bootstrap: a new method for fast error estimation in approximate query processing , 2014, SIGMOD Conference.

[22]  Ron Kohavi,et al.  Bias Plus Variance Decomposition for Zero-One Loss Functions , 1996, ICML.

[23]  Rong Jin,et al.  Batch mode active learning and its application to medical image classification , 2006, ICML.

[24]  Aditya G. Parameswaran,et al.  Active sampling for entity matching , 2012, KDD.

[25]  S. Lahiri,et al.  Bootstrapping Lasso Estimators , 2011 .

[26]  Carlo Zaniolo,et al.  ABS: a system for scalable approximate queries with accuracy guarantees , 2014, SIGMOD Conference.

[27]  Patrick Paroubek,et al.  Twitter as a Corpus for Sentiment Analysis and Opinion Mining , 2010, LREC.

[28]  Anuradha Bhamidipaty,et al.  Interactive deduplication using active learning , 2002, KDD.

[29]  Burr Settles,et al.  Active Learning Literature Survey , 2009 .

[30]  Tim Kraska,et al.  CrowdER: Crowdsourcing Entity Resolution , 2012, Proc. VLDB Endow..

[31]  Sanjoy Dasgupta,et al.  A General Agnostic Active Learning Algorithm , 2007, ISAIM.

[32]  John Langford,et al.  Importance weighted active learning , 2008, ICML '09.

[33]  A. P. Dawid,et al.  Maximum Likelihood Estimation of Observer Error‐Rates Using the EM Algorithm , 1979 .

[34]  Christopher M. Bishop,et al.  Pattern Recognition and Machine Learning (Information Science and Statistics) , 2006 .

[35]  Purnamrita Sarkar,et al.  Active Learning for Crowd-Sourced Databases , 2012, ArXiv.

[36]  Foster J. Provost,et al.  Active Sampling for Class Probability Estimation and Ranking , 2004, Machine Learning.

[37]  Raghav Kaushik,et al.  On active learning of record matching packages , 2010, SIGMOD Conference.

[38]  Tim Kraska,et al.  CrowdDB: answering queries with crowdsourcing , 2011, SIGMOD '11.