Ensembles for supervised classification learning

This dissertation studies the use of multiple classifiers (ensembles or committees) in learning tasks. Both theoretical and practical aspects of combining classifiers are studied. First we analyze the representational ability of voting ensembles. A voting ensemble may perform either better or worse than each of its individual members. We give tight upper and lower bounds on the classification performance of a voting ensemble as a function of the classification performances of its individual members. Boosting is a method of combining multiple "weak" classifiers to form a "strong" classifier. Several issues concerning boosting are studied in this thesis. We study SBA, a hierarchical boosting algorithm proposed by Schapire, in terms of its representation and its search. We show that if the weak learner has low representational complexity, SBA's search may fail to boost or may give a sub-optimal solution. We present a rejection boosting algorithm that trades-off exploration and exploitation: It requires fewer pattern labels at the expense of lower boosting ability. Ensembles may be useful in gaining information. We study their use to minimize labeling costs of data and to enable improvements on performance over time. For that purpose a model for on-site learning is presented. The system learns by querying "hard" patterns while classifying "easy" ones. This model is related to query-based filtering methods, but takes into account that, in addition to labeling, filtering through the data has a cost. The Query-By-Committee algorithm is used as a good approximator of the model space for real-world domains. Results using this algorithm on a synthesized problem and a real-world OCR task using both a back-propagation network and a nearest neighbor classifier show that an on-site learner can perform as well as a classifier trained off-site, while achieving significant cost reduction.

[1]  L. A. Goodman,et al.  Social Choice and Individual Values , 1951 .

[2]  L. Goddard Information Theory , 1962, Nature.

[3]  W. J. Studden,et al.  Theory Of Optimal Experiments , 1972 .

[4]  Richard O. Duda,et al.  Pattern classification and scene analysis , 1974, A Wiley-Interscience publication.

[5]  Peter E. Hart,et al.  Pattern classification and scene analysis , 1974, A Wiley-Interscience publication.

[6]  Tom M. Mitchell,et al.  Version Spaces: A Candidate Elimination Approach to Rule Learning , 1977, IJCAI.

[7]  Nesa L'abbe Wu,et al.  Linear programming and extensions , 1981 .

[8]  Tom M. Mitchell,et al.  Generalization as Search , 2002 .

[9]  Leslie G. Valiant,et al.  A theory of the learnable , 1984, STOC '84.

[10]  Geoffrey E. Hinton,et al.  Learning internal representations by error propagation , 1986 .

[11]  K. Arrow,et al.  Social Choice and Multicriterion Decision-Making , 1986 .

[12]  Ronald L. Rivest,et al.  Learning decision lists , 2004, Machine Learning.

[13]  Vladimir D. Mazurov,et al.  Solving of optimization and identification problems by the committee methods , 1987, Pattern Recognit..

[14]  N. Littlestone Learning Quickly When Irrelevant Attributes Abound: A New Linear-Threshold Algorithm , 1987, 28th Annual Symposium on Foundations of Computer Science (sfcs 1987).

[15]  Reid G. Smith,et al.  Fundamentals of expert systems , 1988 .

[16]  J. Berger Statistical Decision Theory and Bayesian Analysis , 1988 .

[17]  Chris Carter,et al.  Multiple decision trees , 2013, UAI.

[18]  John Scott Bridle,et al.  Probabilistic Interpretation of Feedforward Classification Network Outputs, with Relationships to Statistical Pattern Recognition , 1989, NATO Neurocomputing.

[19]  Christian Lebiere,et al.  The Cascade-Correlation Learning Architecture , 1989, NIPS.

[20]  David A. Cohn,et al.  Training Connectionist Networks with Queries and Selective Sampling , 1989, NIPS.

[21]  Lawrence D. Jackel,et al.  Handwritten character recognition using neural network architectures , 1990 .

[22]  Yann LeCun,et al.  Handwritten zip code recognition with multilayer networks , 1990, [1990] Proceedings. 10th International Conference on Pattern Recognition.

[23]  Amro El-Jaroudi,et al.  A new error criterion for posterior probability estimation with neural nets , 1990, 1990 IJCNN International Joint Conference on Neural Networks.

[24]  Wolfgang Kinzel,et al.  Improving a Network Generalization Ability by Selecting Examples , 1990 .

[25]  Sholom M. Weiss,et al.  Computer Systems That Learn , 1990 .

[26]  Belur V. Dasarathy,et al.  Nearest neighbor (NN) norms: NN pattern classification techniques , 1991 .

[27]  Opper,et al.  Generalization performance of Bayes optimal classification algorithm for learning a perceptron. , 1991, Physical review letters.

[28]  Anders Krogh,et al.  Introduction to the theory of neural computation , 1994, The advanced book program.

[29]  Eric B. Baum,et al.  Neural net algorithms that learn in polynomial time from examples and queries , 1991, IEEE Trans. Neural Networks.

[30]  Geoffrey E. Hinton,et al.  Adaptive Mixtures of Local Experts , 1991, Neural Computation.

[31]  David J. C. MacKay,et al.  Information-Based Objective Functions for Active Data Selection , 1992, Neural Computation.

[32]  L. Cooper,et al.  When Networks Disagree: Ensemble Methods for Hybrid Neural Networks , 1992 .

[33]  H. Sebastian Seung,et al.  Information, Prediction, and Query by Committee , 1992, NIPS.

[34]  Yoav Freund,et al.  An improved boosting algorithm and its implications on learning complexity , 1992, COLT '92.

[35]  David H. Wolpert,et al.  Stacked generalization , 1992, Neural Networks.

[36]  J. Ross Quinlan,et al.  C4.5: Programs for Machine Learning , 1992 .

[37]  H. Sebastian Seung,et al.  Query by committee , 1992, COLT '92.

[38]  Elie Bienenstock,et al.  Neural Networks and the Bias/Variance Dilemma , 1992, Neural Computation.

[39]  Harris Drucker,et al.  Boosting Performance in Neural Networks , 1993, Int. J. Pattern Recognit. Artif. Intell..

[40]  Leslie G. Valiant,et al.  Cryptographic Limitations on Learning Boolean Formulae and Finite Automata , 1993, Machine Learning: From Theory to Applications.

[41]  Robert A. Jacobs,et al.  Hierarchical Mixtures of Experts and the EM Algorithm , 1993, Neural Computation.

[42]  David A. Cohn,et al.  Active Learning with Statistical Models , 1996, NIPS.

[43]  Sollich Query construction, entropy, and generalization in neural-network models. , 1994, Physical review. E, Statistical physics, plasmas, fluids, and related interdisciplinary topics.

[44]  Harris Drucker,et al.  Boosting and Other Ensemble Methods , 1994, Neural Computation.

[45]  Yann LeCun,et al.  Measuring the VC-Dimension of a Learning Machine , 1994, Neural Computation.

[46]  Anders Krogh,et al.  Neural Network Ensembles, Cross Validation, and Active Learning , 1994, NIPS.

[47]  David D. Lewis,et al.  Heterogeneous Uncertainty Sampling for Supervised Learning , 1994, ICML.

[48]  Pat Langley,et al.  Elements of Machine Learning , 1995 .

[49]  Ron Kohavi,et al.  Wrappers for performance enhancement and oblivious decision graphs , 1995 .

[50]  O. Matan On-site Learning , 1995 .

[51]  Yoav Freund,et al.  A decision-theoretic generalization of on-line learning and an application to boosting , 1995, EuroCOLT.

[52]  Vittorio Castelli,et al.  On the exponential value of labeled samples , 1995, Pattern Recognit. Lett..

[53]  Kamal A. Ali,et al.  On the Link between Error Correlation and Error Reduction in Decision Tree Ensembles , 1995 .

[54]  Yoav Freund,et al.  Experiments with a New Boosting Algorithm , 1996, ICML.

[55]  J. Ross Quinlan,et al.  Bagging, Boosting, and C4.5 , 1996, AAAI/IAAI, Vol. 1.

[56]  M. Pazzani,et al.  Learning probabilistic relational concept descriptions , 1996 .

[57]  Catherine Blake,et al.  UCI Repository of machine learning databases , 1998 .

[58]  Leo Breiman,et al.  Bagging Predictors , 1996, Machine Learning.