Individual Regret in Cooperative Nonstochastic Multi-Armed Bandits

We study agents communicating over an underlying network by exchanging messages, in order to optimize their individual regret in a common nonstochastic multi-armed bandit problem. We derive regret minimization algorithms that guarantee for each agent $v$ an individual expected regret of $\widetilde{O}\left(\sqrt{\left(1+\frac{K}{\left|\mathcal{N}\left(v\right)\right|}\right)T}\right)$, where $T$ is the number of time steps, $K$ is the number of actions and $\mathcal{N}\left(v\right)$ is the set of neighbors of agent $v$ in the communication graph. We present algorithms both for the case that the communication graph is known to all the agents, and for the case that the graph is unknown. When the graph is unknown, each agent knows only the set of its neighbors and an upper bound on the total number of agents. The individual regret between the models differs only by a logarithmic factor. Our work resolves an open problem from [Cesa-Bianchi et al., 2019b].

[1]  Andreas Krause,et al.  Multi-Player Bandits: The Adversarial Case , 2019, J. Mach. Learn. Res..

[2]  Claudio Gentile,et al.  Delay and Cooperation in Nonstochastic Bandits , 2016, COLT.

[3]  Peter Auer,et al.  The Nonstochastic Multiarmed Bandit Problem , 2002, SIAM J. Comput..

[4]  Gábor Lugosi,et al.  Prediction, learning, and games , 2006 .

[5]  Amir Leshem,et al.  Distributed Multi-Player Bandits - a Game of Thrones Approach , 2018, NeurIPS.

[6]  Vaibhav Srivastava,et al.  On distributed cooperative decision-making in multiarmed bandits , 2015, 2016 European Control Conference (ECC).

[7]  Nicolò Cesa-Bianchi,et al.  Cooperative Online Learning: Keeping your Neighbors Updated , 2019, ALT.

[8]  Noga Alon,et al.  A Fast and Simple Randomized Parallel Algorithm for the Maximal Independent Set Problem , 1985, J. Algorithms.

[9]  Anit Kumar Sahu,et al.  Dist-Hedge: A partial information setting based distributed non-stochastic sequence prediction algorithm , 2017, 2017 IEEE Global Conference on Signal and Information Processing (GlobalSIP).

[10]  Michael Luby,et al.  A simple parallel algorithm for the maximal independent set problem , 1985, STOC '85.

[11]  Koby Crammer,et al.  Prediction with Limited Advice and Multiarmed Bandits with Paid Observations , 2014, ICML.

[12]  Ohad Shamir,et al.  Multi-player bandits: a musical chairs approach , 2016, ICML 2016.

[13]  István Hegedüs,et al.  Gossip-based distributed stochastic bandit algorithms , 2013, ICML.

[14]  Vaibhav Srivastava,et al.  Distributed cooperative decision-making in multiarmed bandits: Frequentist and Bayesian algorithms , 2016, 2016 IEEE 55th Conference on Decision and Control (CDC).

[15]  Baruch Awerbuch,et al.  Competitive collaborative learning , 2005, J. Comput. Syst. Sci..

[16]  H. Vincent Poor,et al.  Bandit problems in networks: Asymptotically efficient distributed allocation rules , 2011, IEEE Conference on Decision and Control and European Control Conference.

[17]  Aditya Gopalan,et al.  Collaborative learning of stochastic bandits over a social network , 2016, 2016 54th Annual Allerton Conference on Communication, Control, and Computing (Allerton).

[18]  Sébastien Bubeck,et al.  Regret Analysis of Stochastic and Nonstochastic Multi-armed Bandit Problems , 2012, Found. Trends Mach. Learn..

[19]  Shie Mannor,et al.  Concurrent Bandits and Cognitive Radio Networks , 2014, ECML/PKDD.