Using OVA modeling to improve classification performance for large datasets

One-Versus-All (OVA) classification is a classifier construction method where a k-class prediction task is decomposed into k 2-class sub-problems. One base model is constructed for each sub-problem and the base models are then combined into one model. Aggregate model implementation is the process of constructing several base models which are then combined into a single model for prediction. In essence, OVA classification is a method of aggregate modeling. This paper reports studies that were conducted to establish whether OVA classification can provide predictive performance gains when large volumes of data are available for modeling as is commonly the case in data mining. It is demonstrated in this paper that firstly, OVA modeling can be used to increase the amount of training data while at the same time using base model training sets whose size is much smaller than the total amount of available training data. Secondly, OVA models created from large datasets provide a higher level of predictive performance compared to single k-class models. Thirdly, the use of boosted OVA base models can provide higher predictive performance compared to un-boosted OVA base models. Fourthly, when the combination algorithm for base model predictions is able to resolve tied predictions, the resulting aggregate models provide a higher level of predictive performance.

[1]  Paolo Giudici,et al.  Applied Data Mining for Business and Industry , 2009 .

[2]  Thomas G. Dietterich,et al.  Solving Multiclass Learning Problems via Error-Correcting Output Codes , 1994, J. Artif. Intell. Res..

[3]  Tom Fawcett,et al.  Robust Classification for Imprecise Environments , 2000, Machine Learning.

[4]  Stephen D. Bay,et al.  The UCI KDD archive of large data sets for data mining research and experimentation , 2000, SKDD.

[5]  Andries Petrus Engelbrecht,et al.  A decision rule-based method for feature selection in predictive data mining , 2010, Expert Syst. Appl..

[6]  Tin Kam Ho,et al.  The Random Subspace Method for Constructing Decision Forests , 1998, IEEE Trans. Pattern Anal. Mach. Intell..

[7]  อนิรุธ สืบสิงห์,et al.  Data Mining Practical Machine Learning Tools and Techniques , 2014 .

[8]  Peter E. Hart,et al.  Nearest neighbor pattern classification , 1967, IEEE Trans. Inf. Theory.

[9]  Leo Breiman,et al.  Classification and Regression Trees , 1984 .

[10]  P. van der Putten,et al.  A Bias-Variance Analysis of a Real World Learning Problem: The CoIL Challenge 2000 , 2004 .

[11]  Poduri S. R. S. Rao Sampling Methodologies with Applications , 2017 .

[12]  Padhraic Smyth,et al.  Data Mining at the Interface of Computer Science and Statistics , 2001 .

[13]  Michael J. A. Berry,et al.  Mastering Data Mining: The Art and Science of Customer Relationship Management , 1999 .

[14]  Robert L. Grossman,et al.  Data Mining for Scientific and Engineering Applications , 2001, Massive Computing.

[15]  Huan Liu,et al.  Some issues on scalable feature selection , 1998 .

[16]  Huan Liu Scalable Feature Selection for Large Sized Databases , 1998 .

[17]  Thomas G. Dietterich Overfitting and undercomputing in machine learning , 1995, CSUR.

[18]  Tom Fawcett,et al.  Using rule sets to maximize ROC performance , 2001, Proceedings 2001 IEEE International Conference on Data Mining.

[19]  Michael J. Pazzani,et al.  Error reduction through learning multiple descriptions , 2004, Machine Learning.

[20]  Anders Krogh,et al.  Neural Network Ensembles, Cross Validation, and Active Learning , 1994, NIPS.

[21]  Shanshan Wang,et al.  An Effective Combination Based on Class-Wise Expertise of Diverse Classifiers for Predictive Toxicology Data Mining , 2006, ADMA.

[22]  Thomas G. Dietterich,et al.  Machine Learning Bias, Statistical Bias, and Statistical Variance of Decision Tree Algorithms , 2008 .

[23]  Lars Kai Hansen,et al.  Neural Network Ensembles , 1990, IEEE Trans. Pattern Anal. Mach. Intell..

[24]  Paolo Giudici,et al.  Applied Data Mining: Statistical Methods for Business and Industry , 2003 .

[25]  Salvatore J. Stolfo,et al.  A framework for constructing features and models for intrusion detection systems , 2000, TSEC.

[26]  Ron Kohavi,et al.  Bias Plus Variance Decomposition for Zero-One Loss Functions , 1996, ICML.

[27]  Hui Li,et al.  Financial distress prediction based on serial combination of multiple classifiers , 2009, Expert Syst. Appl..

[28]  Patricia E. N. Lutu,et al.  Dataset Selection for Aggregate Model Implementation in Predictive Data Mining , 2010 .

[29]  Jerome H. Friedman,et al.  On Bias, Variance, 0/1—Loss, and the Curse-of-Dimensionality , 2004, Data Mining and Knowledge Discovery.

[30]  Miha Vuk,et al.  ROC curve, lift chart and calibration plot , 2006, Advances in Methodology and Statistics.

[31]  Hui Li,et al.  Listed companies' financial distress prediction based on weighted majority voting combination of multiple classifiers , 2008, Expert Syst. Appl..

[32]  Leo Breiman,et al.  Bagging Predictors , 1996, Machine Learning.

[33]  Elie Bienenstock,et al.  Neural Networks and the Bias/Variance Dilemma , 1992, Neural Computation.

[34]  Maarten van Someren,et al.  A Bias-Variance Analysis of a Real World Learning Problem: The CoIL Challenge 2000 , 2004, Machine Learning.

[35]  David J. Hand,et al.  Statistics and data mining: intersecting disciplines , 1999, SKDD.

[36]  David J. Hand,et al.  Construction and Assessment of Classification Rules , 1997 .

[37]  David J. Hand,et al.  A Simple Generalisation of the Area Under the ROC Curve for Multiple Class Classification Problems , 2001, Machine Learning.

[38]  Tom Fawcett,et al.  An introduction to ROC analysis , 2006, Pattern Recognit. Lett..

[39]  David J. Hand,et al.  Data Mining: Statistics and More? , 1998 .

[40]  Ian H. Witten,et al.  Data mining: practical machine learning tools and techniques, 3rd Edition , 1999 .

[41]  Heikki Mannila,et al.  Principles of Data Mining , 2001, Undergraduate Topics in Computer Science.

[42]  Chi Hoon Lee,et al.  Using Attack-Specific Feature Subsets for Network Intrusion Detection , 2006, Australian Conference on Artificial Intelligence.

[43]  Pat Langley,et al.  Empirical Methods in Artificial Intelligence: A Review , 1996 .

[44]  Christin Schäfer,et al.  Learning Intrusion Detection: Supervised or Unsupervised? , 2005, ICIAP.

[45]  Andries P. Engelbrecht,et al.  Computational Intelligence: An Introduction , 2002 .

[46]  Rajesh Parekh,et al.  Lessons and Challenges from Mining Retail E-Commerce Data , 2004, Machine Learning.

[47]  Chris Carter,et al.  Multiple decision trees , 2013, UAI.

[48]  Tom Fawcett,et al.  ROC Graphs: Notes and Practical Considerations for Researchers , 2007 .

[49]  Thomas G. Dietterich What is machine learning? , 2020, Archives of Disease in Childhood.

[50]  Ian H. Witten,et al.  Data mining - practical machine learning tools and techniques, Second Edition , 2005, The Morgan Kaufmann series in data management systems.

[51]  Madhu Chetty,et al.  Differential prioritization in feature selection and classifier aggregation for multiclass microarray datasets , 2006, Data Mining and Knowledge Discovery.

[52]  Yoav Freund,et al.  A decision-theoretic generalization of on-line learning and an application to boosting , 1995, EuroCOLT.

[53]  Salvatore J. Stolfo,et al.  Toward Scalable Learning with Non-Uniform Class and Cost Distributions: A Case Study in Credit Card Fraud Detection , 1998, KDD.

[54]  Bernhard E. Boser,et al.  A training algorithm for optimal margin classifiers , 1992, COLT '92.

[55]  Ryan M. Rifkin,et al.  In Defense of One-Vs-All Classification , 2004, J. Mach. Learn. Res..