Near Optimal Stratified Sampling

The performance of a machine learning system is usually evaluated by using i.i.d.\ observations with true labels. However, acquiring ground truth labels is expensive, while obtaining unlabeled samples may be cheaper. Stratified sampling can be beneficial in such settings and can reduce the number of true labels required without compromising the evaluation accuracy. Stratified sampling exploits statistical properties (e.g., variance) across strata of the unlabeled population, though usually under the unrealistic assumption that these properties are known. We propose two new algorithms that simultaneously estimate these properties and optimize the evaluation accuracy. We construct a lower bound to show the proposed algorithms (to log-factors) are rate optimal. Experiments on synthetic and real data show the reduction in label complexity that is enabled by our algorithms.

[1]  Andrew McCallum,et al.  Toward interactive training and evaluation , 2011, CIKM '11.

[2]  Grigorios Tsoumakas,et al.  On the Stratification of Multi-label Data , 2011, ECML/PKDD.

[3]  Tong Zhang,et al.  Accelerating Minibatch Stochastic Gradient Descent using Stratified Sampling , 2014, ArXiv.

[4]  Steven K. Thompson On sampling and experiments , 2002 .

[5]  Benjamin I. P. Rubinstein,et al.  In Search of an Entity Resolution OASIS: Optimal Asymptotic Sequential Importance Sampling , 2017, Proc. VLDB Endow..

[6]  Yunming Ye,et al.  Stratified sampling for feature subspace selection in random forests for high dimensional data , 2013, Pattern Recognit..

[7]  Steffen Bickel,et al.  Active Risk Estimation , 2010, ICML.

[8]  Nicholas J. A. Harvey,et al.  Settling the Sample Complexity for Learning Mixtures of Gaussians , 2017, 1710.05209.

[9]  Shie Mannor,et al.  Rotting Bandits , 2017, NIPS.

[10]  Fang Kong,et al.  Semi-Supervised Learning for Semantic Relation Classification using Stratified Sampling Strategy , 2009, EMNLP.

[11]  Pietro Perona,et al.  A Lazy Man's Approach to Benchmarking: Semisupervised Classifier Evaluation and Recalibration , 2013, 2013 IEEE Conference on Computer Vision and Pattern Recognition.

[12]  Paul N. Bennett,et al.  Online stratified sampling: evaluating classifiers at web-scale , 2010, CIKM.

[13]  Jean-Yves Audibert,et al.  Minimax Policies for Adversarial and Stochastic Bandits. , 2009, COLT 2009.

[14]  J. Neyman On the Two Different Aspects of the Representative Method: the Method of Stratified Sampling and the Method of Purposive Selection , 1934 .

[15]  Csaba Szepesvári,et al.  Adaptive Monte Carlo via Bandit Allocation , 2014, ICML.

[16]  Tobias Scheffer,et al.  Active Estimation of F-Measures , 2010, NIPS.

[17]  Sunita Sarawagi,et al.  Active Evaluation of Classifiers on Large Datasets , 2012, 2012 IEEE 12th International Conference on Data Mining.

[18]  Edo Liberty,et al.  Stratified Sampling Meets Machine Learning , 2016, ICML.

[19]  Jorma Rissanen,et al.  Fisher information and stochastic complexity , 1996, IEEE Trans. Inf. Theory.

[20]  Nicholas J. A. Harvey,et al.  Near-optimal Sample Complexity Bounds for Robust Learning of Gaussian Mixtures via Compression Schemes , 2017, J. ACM.