Under the prediction model of learning, a prediction strategy is presented with an i.i.d. sample of n – 1 points in Χ and corresponding labels from a concept ƒ ∈ Ƒ, and aims to minimize the worst-case probability of erring on an nth point. By exploiting the structure of Ƒ, Haussler et al. achieved a VC(Ƒ)/n bound for the natural one-inclusion prediction strategy, improving on bounds implied by PAC-type results by a O(log n) factor. The key data structure in their result is the natural subgraph of the hypercube—the one-inclusion graph; the key step is a d = VC(Ƒ) bound on one-inclusion graph density. The first main result of this paper is a density bound of n (≤n-1d-1) / (≤nd) < d, which positively resolves a conjecture of Kuzmin & Warmuth relating to their unlabeled Peeling compression scheme and also leads to an improved mistake bound for the randomized (deterministic) one-inclusion strategy for all d (for d ≈ Θ(n)). The proof uses a new form of VC-invariant shifting and a group-theoretic symmetrization. Our second main result is a k-class analogue of the d/n mistake bound, replacing the VC-dimension by the Pollard pseudo-dimension and the one-inclusion strategy by its natural hypergraph generalization. This bound on expected risk improves on known PAC-based results by a factor of O (log n) and is shown to be optimal up to a O(log k) factor. The combinatorial technique of shifting takes a central role in understanding the one-inclusion (hyper)graph and is a running theme throughout.
[1]
Leslie G. Valiant,et al.
A general lower bound on the number of examples needed for learning
,
1988,
COLT '88.
[2]
David Haussler,et al.
Sphere Packing Numbers for Subsets of the Boolean n-Cube with Bounded Vapnik-Chervonenkis Dimension
,
1995,
J. Comb. Theory, Ser. A.
[3]
Yi Li,et al.
The one-inclusion graph algorithm is near-optimal for the prediction model of learning
,
2001,
IEEE Trans. Inf. Theory.
[4]
Philip M. Long,et al.
Characterizations of Learnability for Classes of {0, ..., n}-Valued Functions
,
1995,
J. Comput. Syst. Sci..
[5]
Manfred K. Warmuth,et al.
Unlabeled Compression Schemes for Maximum Classes,
,
2007,
COLT.
[6]
Norbert Sauer,et al.
On the Density of Families of Sets
,
1972,
J. Comb. Theory, Ser. A.
[7]
David Haussler,et al.
Predicting {0,1}-functions on randomly drawn points
,
1988,
COLT '88.
[8]
Manfred K. Warmuth,et al.
Relating Data Compression and Learnability
,
2003
.
[9]
Shai Ben-David,et al.
Characterizations of learnability for classes of {O, …, n}-valued functions
,
1992,
COLT '92.