How to Query an Oracle? Efficient Strategies to Label Data

We consider the basic problem of querying an expert oracle for labeling a dataset in machine learning. This is typically an expensive and time consuming process and therefore, we seek ways to do so efficiently. The conventional approach involves comparing each sample with (the representative of) each class to find a match. In a setting with N equally likely classes, this involves N/2 pairwise comparisons (queries per sample) on average. We consider a k-ary query scheme with k>=2 samples in a query that identifies (dis)similar items in the set while effectively exploiting the associated transitive relations. We present a randomized batch algorithm that operates on a round-by-round basis to label the samples and achieves a query rate of O(N/(k^2)). In addition, we present an adaptive greedy query scheme, which achieves an average rate of ~ 0.2N queries per sample with triplet queries. For the proposed algorithms, we investigate the query rate performance analytically and with simulations. Empirical studies suggest that each triplet query takes an expert at most 50% more time compared with a pairwise query, indicating the effectiveness of the proposed k-ary query schemes. We generalize the analyses to nonuniform class distributions when possible.

[1]  J. Massey Guessing and entropy , 1994, Proceedings of 1994 IEEE International Symposium on Information Theory.

[2]  Gunter Saake,et al.  Cloud-Scale Entity Resolution: Current State and Open Challenges , 2018, Open J. Big Data.

[3]  Nilesh N. Dalvi,et al.  Crowdsourcing Algorithms for Entity Resolution , 2014, Proc. VLDB Endow..

[4]  David R. Karger,et al.  Human-powered Sorts and Joins , 2011, Proc. VLDB Endow..

[5]  Xi Chen,et al.  Spectral Methods Meet EM: A Provably Optimal Algorithm for Crowdsourcing , 2014, J. Mach. Learn. Res..

[6]  Pietro Perona,et al.  The Multidimensional Wisdom of Crowds , 2010, NIPS.

[7]  Tim Kraska,et al.  CrowdER: Crowdsourcing Entity Resolution , 2012, Proc. VLDB Endow..

[8]  Babak Hassibi,et al.  Fundamental Limits of Budget-Fidelity Trade-off in Label Crowdsourcing , 2016, NIPS.

[9]  Erhard Rahm,et al.  Frameworks for entity matching: A comparison , 2010, Data Knowl. Eng..

[10]  Babak Hassibi,et al.  Graph Clustering With Missing Data: Convex Algorithms and Analysis , 2014, NIPS.

[11]  Fei-Fei Li,et al.  Novel Dataset for Fine-Grained Image Categorization : Stanford Dogs , 2012 .

[12]  Hector Garcia-Molina,et al.  Entity Resolution with crowd errors , 2015, 2015 IEEE 31st International Conference on Data Engineering.

[13]  Sanjeev Khanna,et al.  Top-k and Clustering with Noisy Comparisons , 2014, ACM Trans. Database Syst..

[14]  Ramya Korlakai Vinayak Graph Clustering: Algorithms, Analysis and Query Design , 2018 .

[15]  Devavrat Shah,et al.  Budget-Optimal Task Allocation for Reliable Crowdsourcing Systems , 2011, Oper. Res..

[16]  Ahmed K. Elmagarmid,et al.  Duplicate Record Detection: A Survey , 2007, IEEE Transactions on Knowledge and Data Engineering.

[17]  A. P. Dawid,et al.  Maximum Likelihood Estimation of Observer Error‐Rates Using the EM Algorithm , 1979 .

[18]  Tim Kraska,et al.  Leveraging transitive relations for crowdsourced joins , 2013, SIGMOD '13.

[19]  Arya Mazumdar,et al.  Query Complexity of Clustering with Side Information , 2017, NIPS.