Open Problem: Lower bounds for Boosting with Hadamard Matrices

Boosting algorithms can be viewed as a zero-sum game. At each iteration a new column / hypothesis is chosen from a game matrix representing the entire hypotheses class. There are algorithms for which the gap between the value of the sub-matrix (the t columns chosen so far) and the value of the entire game matrix is O( √ logn t ). A matching lower bound has been shown for random game matrices for t up to n where α ∈ (0, 1 2 ). We conjecture that with Hadamard matrices we can build a certain game matrix for which the game value grows at the slowest possible rate for t up to a fraction of n. 1. Boosting as a zero-sum game Boosting algorithms follow the following protocol in each iteration (e.g. Freund and Schapire, 1997; Freund, 1995): The algorithm provides a distribution d on a given set of n examples. Then an oracle provides “weak hypothesis” from some hypotheses class and the distribution is updated. At the end, the algorithm outputs a convex combination w of the hypotheses it received from the oracle. One can view Boosting as a zero-sum game between a row and a column player (Freund and Schapire, 1997). Each possible hypothesis provided by the oracle is a column chosen from an underlying game matrix U that represents the entire hypotheses class available to the oracle. The examples correspond to the rows of this matrix. At the end of iteration t, the algorithm has received t columns/hypotheses so far, and we use Ut to denote this sub-matrix of U. The minimax value of Ut is defined as follows: val(Ut) = min d∈Sn max w∈St dUt w = max w∈St min r=1,...,n [Ut w]r. (1) Here d is the distribution on the rows/examples and w represents a convex combination of the t columns of Ut. Finally [Ut w]r is the margin of row/example r wrt the convex combination w of the current hypotheses set. So in Boosting the value of Ut is the maximum minimum margin of all examples achievable with the current t columns of Ut. The value of Ut increases as columns are added and in this view of Boosting, the goal is to raise the value of Ut as quickly as possible to the value of the entire underlying game matrix U. There are boosting algorithms that guarantee that after O( logn 2 ) iterations, the c © 2013 J. Nie, M.K. Warmuth, S. Vishwanathan & X. Zhang. Nie Warmuth Vishwanathan Zhang gap val(U)− val(Ut) is at most (Freund and Schapire, 1997; Ratsch and Warmuth, 2005; Warmuth et al., 2008). In other words, the gap at iteration t is at most O( √ logn t ). Here we are interested in finding game matrices with a matching lower bound for the value gap. The lower bound should hold for any boosting algorithm, and therefore the gap in this case is defined as the maximum over all submatrices Ut of t columns of U: 1 gapt(U) := val(U)−max Ut val(Ut). First notice that the gap is non-zero only when t ≤ n, since for any n ×m (m > n) game matrix, its value is always attained by one of its sub-matrices of size n × (n + 1). This follows from Carathodory theorem which implies that for any column player w ∈ Sm, there is ŵ with support of size at most n+ 1 satisfying Uw = Uŵ. So wlog m ≤ n. Klein and Young (1999) showed that for a limited range of t (log n ≤ t ≤ nα with α ∈ (0, 1 2)), the gap is Ω( √ logn t ) with high probability for random bit matrices U. 2 We claim that with certain game matrices the range of t in this lower bound can be increased. 2. Lower bounds with Hadamard matrices Hadamard matrices have been used before for proving hardness results in Machine Learning (eg Kivinen et al., 1997; Warmuth and Vishwanathan, 2005) and for iteratively constructing game matrices with large gaps (Nemirovski and Yudin, 1983; Ben-Tal et al., 2001). We begin by giving a simple but weak lower bound using these matrices (an adaptation of Proposition 4.2 of Ben-Tal et al. (2001)). Let n = 2k and H be the n × n Hadamard matrix. Define Ĥ to be H with first row removed. We use game matrix U = [ Ĥ −Ĥ ] and let valD(U) denote val ([ U −U ]) . Notice that by definition 1, valD(U) = −minw∈Sn ‖Uw‖∞ ≤ 0. Theorem For 1 ≤ t ≤ n2 , valD(Ĥ) −maxĤt valD(Ĥt) ≥ √ 1 2t , where the maximum is over all sub-matrices Ĥt of t columns of Ĥ. Proof First we show valD(Ĥ) = 0. Notice that Ĥ has row sum zero and valD(Ĥ) = − min w∈Sn ‖Ĥw‖∞ ≥ −‖Ĥ 1 n ‖∞ = 0. Since H has orthogonal columns, we have that for any Ĥt, Ĥ > t Ĥt = n It − 1t1t and min w∈St ‖Ĥtw‖∞ ≥ min w∈St ‖Ĥtw‖2 √ n− 1 = min w∈St √ w>Ĥt Ĥtw n− 1 = min w∈St √ n n− 1 w>w − 1 n− 1 ≥ √ (n− t)/(n− 1)t. 1. Freund (1995) originally gave an adversarial oracle that iteratively produces a hypothesis of error w.r.t. the current distribution, and for any particular algorithm, the oracle can make this go on for Ω( logn 2 ) iterations. A lower bound of Ω( √ (logn)/t) on the value gap is a much stronger type of lower bound. 2. The same lower bound translates to random ±1 matrices via shifting and scaling.