Efficient learning by implicit exploration in bandit problems with side observations

We consider online learning problems under a a partial observability model capturing situations where the information conveyed to the learner is between full information and bandit feedback. In the simplest variant, we assume that in addition to its own loss, the learner also gets to observe losses of some other actions. The revealed losses depend on the learner's action and a directed observation system chosen by the environment. For this setting, we propose the first algorithm that enjoys near-optimal regret guarantees without having to know the observation system before selecting its actions. Along similar lines, we also define a new partial information setting that models online combinatorial optimization problems where the feedback received by the learner is between semi-bandit and full feedback. As the predictions of our first algorithm cannot be always computed efficiently in this setting, we propose another algorithm with similar properties and with the benefit of always being computationally efficient, at the price of a slightly more complicated tuning mechanism. Both algorithms rely on a novel exploration strategy called implicit exploration, which is shown to be more efficient both computationally and information-theoretically than previously studied exploration strategies for the problem.

[1]  L. Gyorfi,et al.  Sequential Prediction of Unbounded Stationary Time Series , 2007, IEEE Transactions on Information Theory.

[2]  Nicolò Cesa-Bianchi,et al.  Combinatorial Bandits , 2012, COLT.

[3]  Gergely Neu,et al.  An Efficient Algorithm for Learning with Semi-bandit Feedback , 2013, ALT.

[4]  David Haussler,et al.  How to use expert advice , 1993, STOC.

[5]  Wei Chen,et al.  Combinatorial Multi-Armed Bandit: General Framework and Applications , 2013, ICML.

[6]  Peter Auer,et al.  The Nonstochastic Multiarmed Bandit Problem , 2002, SIAM J. Comput..

[7]  Gábor Lugosi,et al.  Mathematics of operations research , 1998 .

[8]  Vladimir Vovk,et al.  Aggregating strategies , 1990, COLT '90.

[9]  Shie Mannor,et al.  From Bandits to Experts: On the Value of Side-Observations , 2011, NIPS.

[10]  Gábor Lugosi,et al.  Prediction, learning, and games , 2006 .

[11]  Santosh S. Vempala,et al.  Efficient algorithms for online decision problems , 2005, J. Comput. Syst. Sci..

[12]  Philip Wolfe,et al.  Contributions to the theory of games , 1953 .

[13]  James Hannan,et al.  4. APPROXIMATION TO RAYES RISK IN REPEATED PLAY , 1958 .

[14]  Claudio Gentile,et al.  Adaptive and Self-Confident On-Line Learning Algorithms , 2000, J. Comput. Syst. Sci..

[15]  Noga Alon,et al.  From Bandits to Experts: A Tale of Domination and Independence , 2013, NIPS.

[16]  Manfred K. Warmuth,et al.  The Weighted Majority Algorithm , 1994, Inf. Comput..

[17]  Marcus Hutter,et al.  Prediction with Expert Advice by Following the Perturbed Leader for General Weights , 2004, ALT.