Rate of approximation results motivated by robust neural network learning

The set of functions which a single hidden layer neural network can approximate is increasingly well understood, yet our knowledge of how the approximation error depends upon the number of hidden units, i.e. the rate of approximation, remains relatively primitive. Barron [1991] and Jones [1992] give bounds on the rate of approximation valid for Hilbert spaces. We derive bounds for L spaces, 1 < p < m, recovering the 0(1 /&) bounds of Barron and Jones for the case p = 2. The results were motivated in part by the desire to understand approximation in the more “robust” (resistant to exemplar noise) LP, 1 ~ p <2 norms. Consider the task of approximating a given target function f by a linear combination of n functions from a set S. For example, S may be the set of possible sigmoidal activation functions, {g : ~d ~[% 6 ~d, b E ~, s.t. g(z) = a(a . z + b)}, in which case the approximants are single hidden layer neural networks with a linear output layer. It is known that under very weak conditions on IS (it must be Riemann integrable and nonpolynomial), the linear span of S is dense in the set of continuous functions on compact subsets of ~d (i.e. for all positive c there is a linear combination of functions in S which can approximate any continuous function to within c everywhere on a compact domain) [Leshno et al. 1992]. Consider the important rate of approximation issue, i.e. the rate at which the achievable error reduces as we allow larger subsets of S to be used in const rutting the approximant. In the context of neural networks, this is the question of how the approximation error scales with the number of hidden units in the network. Unfortunately, approximation bounds for target functions ~ arbitrarily located in the linear closure (i.e. the closure of the span) of S are unknown. Permission to copy without fee all or part of this material is granted provided that the copies are not made or distributed for direct commercial advantage, the ACM copyright notice and the title of the publication and its date appear, and notice is given that copying is by permission of the Association for Computing Machinery. To copy otherwise, or to republish, requires a fee and/or specific permission. ACM COLT ’93 17/931CA, USA @ 1993 ACM 0-89791-61 1-5193 /000710303 . ..$1 .50 Leonid Gurvits Eduardo Sontag Learning Systems Dept. Dept. of Mathematics Siemens Corp. Research Rutgers University 755 College Road East New Brunswick, NJ 08903 Princeton, NJ 08540 sontag@control. rutgers. edu gurvitsrlscr. aiemena. com However, progress haa been made recently by introducing the assumptions that f is in the convex closure of S, and that S is bounded in the relevant norm. This theory depends neither on the continuity of f nor on the form of the functions in S (i.e. the functions in S do not need to be sigmoidal or obey the constraints on ~ listed above), but only on the properties of the function space and some generic properties of S. Definition 1 Let X be a Banach space v.rith norm II ! II. Let S ~ X and f E X. Dejine lllinnS fl[ := inf ‘&W, – f , (1) i=l where the injimum is over all gl, . . . . gn c S and al, ..., an E ~. Also define llconS – fll := inf ~ a~ga f , (2) %=1 where the infimum is over all gl, . . . . g~ E S and al, ..., crn G ~+ U {O} such that ~ ai = 1. That is, lllinnS – fll is the distance of f from the closest span of n functions from S (linear approximation bound), and llconS – fll is the distance off from the closest convex hull of n functions from S (convex approximation bound). Note that [1Iinn S – f II ~ IIco.S f Il. These bounds converge to zero as n ~ co for approximable f and thus represent the convergence rates of the best approximants to the target function. The study of such rates is standard in approximation theory (e.g. [Powell 1981] ), but the cases of interest for neural networks are not among those classically considered. For spaces of square-integrable functions (or more general Hilbert spaces) and bounded sets S, Barron [1991] presented results at this conference to the effect that llco~S – fllz = 0(1/@). Subsequently, under additional conditions on S, he has shown that the same rate obtains for the uniform norm [Barren 1992]. If we consider the procedure of constructing approximants to f incrementally, bv formimz a convex combination of the last approxirna~t with a-single new element convergence rate in Lz is interestingly again 303 of S, the o(l/fi)