A comparative study of sample selection methods for classification

Sampling of large datasets for data mining is important for at least two reasons. The processing of large amounts of data results in increased computational complexity. The cost of this additional complexity may not be justifiable. On the other hand, the use of small samples results in fast and efficient computation for data mining algorithms. Statistical methods for obtaining sufficient samples from datasets for classification problems are discussed in this paper. Results are presented for an empirical study based on the use of sequential random sampling and sample evaluation using univariate hypothesis testing and an information theoretic measure. Comparisons are made between theoretical and empirical estimates.

[1]  Eric R. Ziegel,et al.  Engineering Statistics , 2004, Technometrics.

[2]  Vladimir N. Vapnik,et al.  The Nature of Statistical Learning Theory, Second Edition , 2000, Statistics for Engineering and Information Science.

[3]  Ron Kohavi,et al.  Scaling Up the Accuracy of Naive-Bayes Classifiers: A Decision-Tree Hybrid , 1996, KDD.

[4]  Heekuck Oh,et al.  Neural Networks for Pattern Recognition , 1993, Adv. Comput..

[5]  Tapio Elomaa,et al.  Progressive rademacher sampling , 2002, AAAI/IAAI.

[6]  Catherine Blake,et al.  UCI Repository of machine learning databases , 1998 .

[7]  Christos Faloutsos,et al.  Density biased sampling: an improved method for data mining and clustering , 2000, SIGMOD '00.

[8]  Peter Auer,et al.  Theory and Applications of Agnostic PAC-Learning with Small Decision Trees , 1995, ICML.

[9]  Vladimir N. Vapnik,et al.  The Nature of Statistical Learning Theory , 2000, Statistics for Engineering and Information Science.

[10]  Pat Langley,et al.  Static Versus Dynamic Sampling for Data Mining , 1996, KDD.

[11]  Peter E. Hart,et al.  Nearest neighbor pattern classification , 1967, IEEE Trans. Inf. Theory.

[12]  Jason Catlett,et al.  Megainduction: A Test Flight , 1991, ML.

[13]  Andries P. Engelbrecht Sensitivity analysis of multilayer neural networks , 1999 .

[14]  Thomas G. Dietterich Overfitting and undercomputing in machine learning , 1995, CSUR.

[15]  R. Wilcox Fundamentals of Modern Statistical Methods: Substantially Improving Power and Accuracy , 2001 .

[16]  Leslie G. Valiant,et al.  A theory of the learnable , 1984, STOC '84.

[17]  D. J. Newman,et al.  UCI Repository of Machine Learning Database , 1998 .

[18]  David Haussler,et al.  Probably Approximately Correct Learning , 2010, Encyclopedia of Machine Learning.

[19]  J. Ross Quinlan,et al.  C4.5: Programs for Machine Learning , 1992 .

[20]  Thomas G. Dietterich What is machine learning? , 2020, Archives of Disease in Childhood.

[21]  Frank Olken,et al.  Random Sampling from Databases , 1993 .

[22]  K. Yuen,et al.  The two-sample trimmed t for unequal population variances , 1974 .

[23]  Srinivasan Parthasarathy,et al.  Evaluation of sampling for data mining of association rules , 1997, Proceedings Seventh International Workshop on Research Issues in Data Engineering. High Performance Database Management for Large-Scale Applications.

[24]  Doron Rotem,et al.  Random sampling from databases: a survey , 1995 .

[25]  Hannu Toivonen,et al.  Sampling Large Databases for Association Rules , 1996, VLDB.

[26]  Tim Oates,et al.  Efficient progressive sampling , 1999, KDD '99.