Learning Near-optimal Convex Combinations of Basis Models with Generalization Guarantees

The problem of learning an optimal convex combination of basis models has been studied in a number of works, with a focus on the theoretical analysis, but little investigation on the empirical performance of the approach. In this paper, we present some new theoretical insights, and empirical results that demonstrate the effectiveness of the approach. Theoretically, we first consider whether we can replace convex combinations by linear combinations, and obtain convergence results similar to existing results for learning from a convex hull. We present a negative result showing that the linear hull of very simple basis functions can have unbounded capacity, and is thus prone to overfitting. On the other hand, convex hulls are still rich but have bounded capacities. In addition, we obtain a generalization bound for a general class of Lipschitz loss functions. Empirically, we first discuss how a convex combination can be greedily learned with early stopping, and how a convex combination can be non-greedily learned when the number of basis models is known a priori. Our experiments suggest that the greedy scheme is competitive with or better than several baselines, including boosting and random forests. The greedy algorithm requires little effort in hyper-parameter tuning, and also seems to adapt to the underlying complexity of the problem.

[1]  P ? ? ? ? ? ? ? % ? ? ? ? , 1991 .

[2]  Tsuyoshi Murata,et al.  {m , 1934, ACML.

[3]  Patrice Marcotte,et al.  Some comments on Wolfe's ‘away step’ , 1986, Math. Program..

[4]  D. Pollard Convergence of stochastic processes , 1984 .

[5]  Peter L. Bartlett,et al.  Neural Network Learning - Theoretical Foundations , 1999 .

[6]  Peter L. Bartlett,et al.  Functional Gradient Techniques for Combining Hypotheses , 2000 .

[7]  Robert E. Schapire,et al.  Explaining AdaBoost , 2013, Empirical Inference.

[8]  Tong Zhang,et al.  Sequential greedy approximation for certain convex optimization problems , 2003, IEEE Trans. Inf. Theory.

[9]  Andrew R. Barron,et al.  Universal approximation bounds for superpositions of a sigmoidal function , 1993, IEEE Trans. Inf. Theory.

[10]  C. Darken,et al.  Constructive Approximation Rates of Convex Approximation in Non-hilbert Spaces , 2022 .

[11]  David Mease,et al.  Explaining the Success of AdaBoost and Random Forests as Interpolating Classifiers , 2015, J. Mach. Learn. Res..

[12]  Peter L. Bartlett,et al.  Boosting Algorithms as Gradient Descent , 1999, NIPS.

[13]  Martin Jaggi,et al.  Revisiting Frank-Wolfe: Projection-Free Sparse Convex Optimization , 2013, ICML.

[14]  Peter L. Bartlett,et al.  Efficient agnostic learning of neural networks with bounded fan-in , 1996, IEEE Trans. Inf. Theory.

[15]  V. Koltchinskii,et al.  Complexities of convex combinations and bounding the generalization error in classification , 2004, math/0405356.

[16]  Stefano Ermon,et al.  Boosted Generative Models , 2016, AAAI.

[17]  John B. Shoven,et al.  I , Edinburgh Medical and Surgical Journal.

[18]  Stefan Wager,et al.  Adaptive Concentration of Regression Trees, with Application to Random Forests , 2015, 1503.06388.

[19]  Gunnar Rätsch,et al.  Boosting Black Box Variational Inference , 2018, NeurIPS.

[20]  Luca Antiga,et al.  Automatic differentiation in PyTorch , 2017 .

[21]  Y. Makovoz Random Approximants and Neural Networks , 1996 .

[22]  Yoav Freund,et al.  A decision-theoretic generalization of on-line learning and an application to boosting , 1995, EuroCOLT.

[23]  Philip Wolfe,et al.  An algorithm for quadratic programming , 1956 .

[24]  Zhi-Hua Zhou,et al.  On the doubt about margin explanation of boosting , 2010, Artif. Intell..

[25]  Martin Jaggi,et al.  On the Global Linear Convergence of Frank-Wolfe Optimization Variants , 2015, NIPS.

[26]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[27]  Leo Breiman,et al.  Random Forests , 2001, Machine Learning.

[28]  Thomas Gärtner,et al.  Greedy Feature Construction , 2016, NIPS.

[29]  Shie Mannor,et al.  Greedy Algorithms for Classification -- Consistency, Convergence Rates, and Adaptivity , 2003, J. Mach. Learn. Res..

[30]  L. Jones A Simple Lemma on Greedy Approximation in Hilbert Space and Convergence Rates for Projection Pursuit Regression and Neural Network Training , 1992 .

[31]  Shahar Mendelson,et al.  A Few Notes on Statistical Learning Theory , 2002, Machine Learning Summer School.

[32]  Bernhard Schölkopf,et al.  AdaGAN: Boosting Generative Models , 2017, NIPS.

[33]  Peter L. Bartlett,et al.  AdaBoost is Consistent , 2006, J. Mach. Learn. Res..

[34]  Senén Barro,et al.  Do we need hundreds of classifiers to solve real world classification problems? , 2014, J. Mach. Learn. Res..

[35]  Daijin Kim,et al.  Robust Real-Time Face Detection Using Face Certainty Map , 2007, ICB.

[36]  Francis R. Bach,et al.  Breaking the Curse of Dimensionality with Convex Neural Networks , 2014, J. Mach. Learn. Res..

[37]  Vladimir Koltchinskii,et al.  Rademacher penalties and structural risk minimization , 2001, IEEE Trans. Inf. Theory.