Reducing Multiclass to Binary: A Unifying Approach for Margin Classifiers

We present a unifying framework for studying the solution of multiclass categorization problems by reducing them to multiple binary problems that are then solved using a margin-based binary learning algorithm. The proposed framework unifies some of the most popular approaches in which each class is compared against all others, or in which all pairs of classes are compared to each other, or in which output codes with error-correcting properties are used. We propose a general method for combining the classifiers generated on the binary problems, and we prove a general empirical multiclass loss bound given the empirical loss of the individual binary learning algorithms. The scheme and the corresponding bounds apply to many popular classification learning algorithms including support-vector machines, AdaBoost, regression, logistic regression and decision-tree algorithms. We also give a multiclass generalization error analysis for general output codes with AdaBoost as the binary learner. Experimental results with SVM and AdaBoost show that our scheme provides a viable alternative to the most commonly used multiclass algorithms.

[1]  R. Fisher THE USE OF MULTIPLE MEASUREMENTS IN TAXONOMIC PROBLEMS , 1936 .

[2]  F ROSENBLATT,et al.  The perceptron: a probabilistic model for information storage and organization in the brain. , 1958, Psychological review.

[3]  A. A. Mullin,et al.  Principles of neurodynamics , 1962 .

[4]  W. Hoeffding Probability Inequalities for sums of Bounded Random Variables , 1963 .

[5]  L. Glass,et al.  Oscillation and chaos in physiological control systems. , 1977, Science.

[6]  Leo Breiman,et al.  Classification and Regression Trees , 1984 .

[7]  Geoffrey E. Hinton,et al.  Learning internal representations by error propagation , 1986 .

[8]  Martin Casdagli,et al.  Nonlinear prediction of chaotic time series , 1989 .

[9]  J. Ross Quinlan,et al.  C4.5: Programs for Machine Learning , 1992 .

[10]  Hans Ulrich Simon,et al.  Robust Trainability of Single Neurons , 1995, J. Comput. Syst. Sci..

[11]  Paul A. Fishwick,et al.  Feedforward Neural Nets as Models for Time Series Forecasting , 1993, INFORMS J. Comput..

[12]  N. Fisher,et al.  Probability Inequalities for Sums of Bounded Random Variables , 1994 .

[13]  Andreas S. Weigend,et al.  Time Series Prediction: Forecasting the Future and Understanding the Past , 1994 .

[14]  Vladimir Vapnik,et al.  The Nature of Statistical Learning , 1995 .

[15]  Marie Cottrell,et al.  Neural modeling for time series: A statistical stepwise method for weight elimination , 1995, IEEE Trans. Neural Networks.

[16]  金田 重郎,et al.  C4.5: Programs for Machine Learning (書評) , 1995 .

[17]  Yoav Freund,et al.  A decision-theoretic generalization of on-line learning and an application to boosting , 1995, EuroCOLT.

[18]  Thomas G. Dietterich,et al.  Solving Multiclass Learning Problems via Error-Correcting Output Codes , 1994, J. Artif. Intell. Res..

[19]  Yishay Mansour,et al.  On the boosting ability of top-down decision tree learning algorithms , 1996, STOC '96.

[20]  Yoav Freund,et al.  A decision-theoretic generalization of on-line learning and an application to boosting , 1997, EuroCOLT.

[21]  Yoav Freund,et al.  Boosting the margin: A new explanation for the effectiveness of voting methods , 1997, ICML.

[22]  L. Breiman Arcing the edge , 1997 .

[23]  Federico Girosi,et al.  Training support vector machines: an application to face detection , 1997, Proceedings of IEEE Computer Society Conference on Computer Vision and Pattern Recognition.

[24]  John D. Lafferty,et al.  Inducing Features of Random Fields , 1995, IEEE Trans. Pattern Anal. Mach. Intell..

[25]  Gunnar Rätsch,et al.  Predicting Time Series with Support Vector Machines , 1997, ICANN.

[26]  Robert Tibshirani,et al.  Classification by Pairwise Coupling , 1997, NIPS.

[27]  Peter L. Bartlett,et al.  The Sample Complexity of Pattern Classification with Neural Networks: The Size of the Weights is More Important than the Size of the Network , 1998, IEEE Trans. Inf. Theory.

[28]  Yoram Singer,et al.  Improved Boosting Algorithms Using Confidence-rated Predictions , 1998, COLT' 98.

[29]  Michael Y. Hu,et al.  Neural network forecasting of the British pound/US dol-lar exchange rate , 1998 .

[30]  Vladimir Vapnik,et al.  Statistical learning theory , 1998 .

[31]  Takeo Kanade,et al.  Neural Network-Based Face Detection , 1998, IEEE Trans. Pattern Anal. Mach. Intell..

[32]  J. C. BurgesChristopher A Tutorial on Support Vector Machines for Pattern Recognition , 1998 .

[33]  Yoram Singer,et al.  Improved Boosting Algorithms Using Confidence-rated Predictions , 1998, COLT' 98.

[34]  R. Tibshirani,et al.  Additive Logistic Regression : a Statistical View ofBoostingJerome , 1998 .

[35]  Catherine Blake,et al.  UCI Repository of machine learning databases , 1998 .

[36]  Vladimir Vapnik,et al.  An overview of statistical learning theory , 1999, IEEE Trans. Neural Networks.

[37]  John C. Platt,et al.  Fast training of support vector machines using sequential minimal optimization, advances in kernel methods , 1999 .

[38]  Venkatesan Guruswami,et al.  Multiclass learning, boosting, and error-correcting codes , 1999, COLT '99.

[39]  Yishay Mansour,et al.  On the Boosting Ability of Top-Down Decision Tree Learning Algorithms , 1999, J. Comput. Syst. Sci..

[40]  J. Lafferty Additive models, boosting, and inference for generalized divergences , 1999, COLT '99.

[41]  Yoav Freund,et al.  An Adaptive Version of the Boost by Majority Algorithm , 1999, COLT '99.

[42]  J. Friedman Special Invited Paper-Additive logistic regression: A statistical view of boosting , 2000 .

[43]  Vladimir N. Vapnik,et al.  The Nature of Statistical Learning Theory , 2000, Statistics for Engineering and Information Science.

[44]  Bernhard Schölkopf,et al.  New Support Vector Algorithms , 2000, Neural Computation.

[45]  James A. Bucklew,et al.  Support vector machine techniques for nonlinear equalization , 2000, IEEE Trans. Signal Process..

[46]  Nello Cristianini,et al.  Classification using String Kernels , 2000 .

[47]  Peter L. Bartlett,et al.  Functional Gradient Techniques for Combining Hypotheses , 2000 .

[48]  David R. Musicant,et al.  Lagrangian Support Vector Machines , 2001, J. Mach. Learn. Res..

[49]  Michael Y. Hu,et al.  A simulation study of artificial neural networks for nonlinear time-series forecasting , 2001, Comput. Oper. Res..

[50]  Yuh-Jye Lee,et al.  SSVM: A Smooth Support Vector Machine for Classification , 2001, Comput. Optim. Appl..

[51]  Fang-Mei Tseng,et al.  Combining neural network model with seasonal time series ARIMA model , 2002 .

[52]  Chih-Jen Lin,et al.  A comparison of methods for multiclass support vector machines , 2002, IEEE Trans. Neural Networks.

[53]  Johan A. K. Suykens,et al.  Weighted least squares support vector machines: robustness and sparse approximation , 2002, Neurocomputing.

[54]  Sheng-De Wang,et al.  Fuzzy support vector machines , 2002, IEEE Trans. Neural Networks.

[55]  I. Song,et al.  Working Set Selection Using Second Order Information for Training Svm, " Complexity-reduced Scheme for Feature Extraction with Linear Discriminant Analysis , 2022 .