The Importance of Convexity in Learning with Squared Loss

We show that if the closure of a function class F under the metric induced by some probability distribution is not convex, then the sample complexity for agnostically learning F with squared loss (using only hypotheses in F) is /spl Omega/(ln(1//spl delta/)//spl epsiv//sup 2/) where 1-/spl delta/ is the probability of success and /spl epsiv/ is the required accuracy. In comparison, if the class F is convex and has finite pseudodimension, then the sample complexity is O(1//spl epsiv/(ln(1//spl epsiv/)+ln(1/b)). If a nonconvex class F has finite pseudodimension, then the sample complexity for agnostically learning the closure of the convex hull of F, is O(1//spl epsiv/(1//spl epsiv/(ln(1//spl epsiv/)+ln(1//spl delta/)). Hence, for agnostic learning, learning the convex hull provides better approximation capabilities with little sample complexity penalty.

[1]  R. A. Silverman,et al.  Introductory Real Analysis , 1972 .

[2]  D. Pollard Convergence of stochastic processes , 1984 .

[3]  D. Braess Nonlinear Approximation Theory , 1986 .

[4]  H. Balsters,et al.  Learnability with respect to fixed distributions , 1991 .

[5]  L. Jones A Simple Lemma on Greedy Approximation in Hilbert Space and Convergence Rates for Projection Pursuit Regression and Neural Network Training , 1992 .

[6]  Linda Sellie,et al.  Toward efficient agnostic learning , 1992, COLT '92.

[7]  David Haussler,et al.  Decision Theoretic Generalizations of the PAC Model for Neural Net and Other Learning Applications , 1992, Inf. Comput..

[8]  Andrew R. Barron,et al.  Universal approximation bounds for superpositions of a sigmoidal function , 1993, IEEE Trans. Inf. Theory.

[9]  Daniel F. McCaffrey,et al.  Convergence rates for single hidden layer feedforward networks , 1994, Neural Networks.

[10]  Robert E. Schapire,et al.  Efficient distribution-free learning of probabilistic concepts , 1990, Proceedings [1990] 31st Annual Symposium on Foundations of Computer Science.

[11]  Shai Ben-David,et al.  Learning distributions by their density-levels - a paradigm for learning without a teacher , 1995, EuroCOLT.

[12]  Peter L. Bartlett,et al.  On efficient agnostic learning of linear combinations of basis functions , 1995, COLT '95.

[13]  Wolfgang Maass,et al.  Agnostic PAC Learning of Functions on Analog Neural Nets , 1993, Neural Computation.

[14]  D. Pollard Uniform ratio limit theorems for empirical processes , 1995 .

[15]  Y. Makovoz Random Approximants and Neural Networks , 1996 .

[16]  Peter L. Bartlett,et al.  Efficient agnostic learning of neural networks with bounded fan-in , 1996, IEEE Trans. Inf. Theory.

[17]  Jon A. Wellner,et al.  Weak Convergence and Empirical Processes: With Applications to Statistics , 1996 .

[18]  Sanjeev R. Kulkarni,et al.  Covering numbers for real-valued function classes , 1997, IEEE Trans. Inf. Theory.