Generalization Error of Combined Classifiers

Abstract We derive an upper bound on the generalization error of classifiers which can be represented as thresholded convex combinations of thresholded convex combinations of functions. Such classifiers include single hidden-layer threshold networks and voted combinations of decision trees (such as those produced by boosting algorithms). The derived bound depends on the proportion of training examples with margin less than some threshold and the average complexity of the combined functions (where the average is over the weights assigned to each function in the convex combination). The complexity of the individual functions in the combination depends on their closeness to threshold. By representing a decision tree as a thresholded convex combination of weighted leaf functions, we apply this result to bound the generalization error of combinations of decision trees. Previous bounds depend on the margin of the combined classifier and the average complexity of the decision trees in the combination, where the complexity of each decision tree depends on the total number of leaves. Our bound also depends on the margin of the combined classifier and the average complexity of the decision trees, but our measure of complexity for an individual decision tree is based on the distribution of training examples over leaves and can be significantly smaller than the total number of leaves.

[1]  W. Hoeffding Probability Inequalities for sums of Bounded Random Variables , 1963 .

[2]  Vladimir Vapnik,et al.  Chervonenkis: On the uniform convergence of relative frequencies of events to their probabilities , 1971 .

[3]  Norbert Sauer,et al.  On the Density of Families of Sets , 1972, J. Comb. Theory, Ser. A.

[4]  David Haussler,et al.  What Size Net Gives Valid Generalization? , 1989, Neural Computation.

[5]  Adam Kowalczyk,et al.  Developing higher-order networks with empirically selected units , 1994, IEEE Trans. Neural Networks.

[6]  John Shawe-Taylor,et al.  A Result of Vapnik with Applications , 1993, Discrete Applied Mathematics.

[7]  Yoav Freund,et al.  A decision-theoretic generalization of on-line learning and an application to boosting , 1995, EuroCOLT.

[8]  László Györfi,et al.  A Probabilistic Theory of Pattern Recognition , 1996, Stochastic Modelling and Applied Probability.

[9]  Yoav Freund,et al.  Experiments with a New Boosting Algorithm , 1996, ICML.

[10]  Yoav Freund,et al.  Boosting the margin: A new explanation for the effectiveness of voting methods , 1997, ICML.

[11]  Peter L. Bartlett,et al.  Generalization in Decision Trees and DNF: Does Size Matter? , 1997, NIPS.

[12]  Peter L. Bartlett,et al.  The Sample Complexity of Pattern Classification with Neural Networks: The Size of the Weights is More Important than the Size of the Network , 1998, IEEE Trans. Inf. Theory.

[13]  P. Bartlett,et al.  Generalization in Threshold Networks , Combined Decision Trees and Combined Mask Perceptrons , 1998 .

[14]  Dale Schuurmans,et al.  Boosting in the Limit: Maximizing the Margin of Learned Ensembles , 1998, AAAI/IAAI.

[15]  L. Breiman Arcing classifier (with discussion and a rejoinder by the author) , 1998 .

[16]  Leo Breiman,et al.  Prediction Games and Arcing Algorithms , 1999, Neural Computation.

[17]  Leo Breiman,et al.  Bagging Predictors , 1996, Machine Learning.