Minimax Games with Bandits

One of the earliest online learning games, now commonly known as the hedge setting [Freund and Schapire, 1997], goes as follows. On round t, a Learner chooses a distribution wt over a set of n actions, an Adversary reveals `t ∈ [0, 1], a vector of losses for each action, and the Learner suffers wt · `t = ∑n i=1 wt,i`t,i. Freund and Schapire [1997] showed that a very simple strategy of exponentially weighting the actions according to their cumulative losses provides a near-optimal guarantee. That is, by setting