Correcting Experience Replay for Multi-Agent Communication

We consider the problem of learning to communicate using multi-agent reinforcement learning (MARL). A common approach is to learn off-policy, using data sampled from a replay buffer. However, messages received in the past may not accurately reflect the current communication policy of each agent, and this complicates learning. We therefore introduce a 'communication correction' which accounts for the non-stationarity of observed communication induced by multi-agent learning. It works by relabelling the received message to make it likely under the communicator's current policy, and thus be a better reflection of the receiver's current environment. To account for cases in which agents are both senders and receivers, we introduce an ordered relabelling scheme. Our correction is computationally efficient and can be integrated with a range of off-policy algorithms. It substantially improves the ability of communicating MARL systems to learn across a variety of cooperative and competitive tasks.

[1]  Tom Eccles,et al.  Biases for Emergent Communication in Multi-agent Reinforcement Learning , 2019, NeurIPS.

[2]  Yi Wu,et al.  Multi-Agent Actor-Critic for Mixed Cooperative-Competitive Environments , 2017, NIPS.

[3]  Peter Stone,et al.  Deep Recurrent Q-Learning for Partially Observable MDPs , 2015, AAAI Fall Symposia.

[4]  Peter Dayan,et al.  Feudal Multi-Agent Hierarchies for Cooperative Reinforcement Learning , 2019, ICLR 2019.

[5]  Pieter Abbeel,et al.  Emergence of Grounded Compositional Language in Multi-Agent Populations , 2017, AAAI.

[6]  Tom Schaul,et al.  StarCraft II: A New Challenge for Reinforcement Learning , 2017, ArXiv.

[7]  Joelle Pineau,et al.  TarMAC: Targeted Multi-Agent Communication , 2018, ICML.

[8]  Fei Sha,et al.  Actor-Attention-Critic for Multi-Agent Reinforcement Learning , 2018, ICML.

[9]  Maja J. Mataric,et al.  Using communication to reduce locality in distributed multiagent learning , 1997, J. Exp. Theor. Artif. Intell..

[10]  Michael P. Wellman,et al.  Multiagent Reinforcement Learning: Theoretical Framework and an Algorithm , 1998, ICML.

[11]  Ben Poole,et al.  Categorical Reparameterization with Gumbel-Softmax , 2016, ICLR.

[12]  Sean Luke,et al.  Cooperative Multi-Agent Learning: The State of the Art , 2005, Autonomous Agents and Multi-Agent Systems.

[13]  Tom Schaul,et al.  FeUdal Networks for Hierarchical Reinforcement Learning , 2017, ICML.

[14]  Nando de Freitas,et al.  Social Influence as Intrinsic Motivation for Multi-Agent Deep Reinforcement Learning , 2018, ICML.

[15]  Feng Wu,et al.  Feudal Multi-Agent Deep Reinforcement Learning for Traffic Signal Control , 2020, AAMAS.

[16]  Bart De Schutter,et al.  A Comprehensive Survey of Multiagent Reinforcement Learning , 2008, IEEE Transactions on Systems, Man, and Cybernetics, Part C (Applications and Reviews).

[17]  Ivan Titov,et al.  Emergence of Language with Multi-agent Games: Learning to Communicate with Sequences of Symbols , 2017, NIPS.

[18]  Joelle Pineau,et al.  On the Pitfalls of Measuring Emergent Communication , 2019, AAMAS.

[19]  Sergey Levine,et al.  Data-Efficient Hierarchical Reinforcement Learning , 2018, NeurIPS.

[20]  Rob Fergus,et al.  Learning Multiagent Communication with Backpropagation , 2016, NIPS.

[21]  Tom Schaul,et al.  Prioritized Experience Replay , 2015, ICLR.

[22]  Michael L. Littman,et al.  Markov Games as a Framework for Multi-Agent Reinforcement Learning , 1994, ICML.

[23]  Shimon Whiteson,et al.  OFFER: Off-Environment Reinforcement Learning , 2017, AAAI.

[24]  Joel Z. Leibo,et al.  Options as responses: Grounding behavioural hierarchies in multi-agent RL , 2019, ArXiv.

[25]  Zongqing Lu,et al.  Learning Attentional Communication for Multi-Agent Cooperation , 2018, NeurIPS.

[26]  Yee Whye Teh,et al.  The Concrete Distribution: A Continuous Relaxation of Discrete Random Variables , 2016, ICLR.

[27]  Alex Graves,et al.  Playing Atari with Deep Reinforcement Learning , 2013, ArXiv.

[28]  Drew Wicke,et al.  Multiagent Soft Q-Learning , 2018, AAAI Spring Symposia.

[29]  Marcin Andrychowicz,et al.  Hindsight Experience Replay , 2017, NIPS.

[30]  J. Morgan,et al.  Cheap Talk , 2005 .

[31]  Shimon Whiteson,et al.  Counterfactual Multi-Agent Policy Gradients , 2017, AAAI.

[32]  Sumit Kumar,et al.  Learning Transferable Cooperative Behavior in Multi-Agent Teams , 2019, ArXiv.

[33]  B. Brookes,et al.  Statistical Theory of Extreme Values and Some Practical Applications , 1955, The Mathematical Gazette.

[34]  Yuval Tassa,et al.  Continuous control with deep reinforcement learning , 2015, ICLR.

[35]  Hoon Kim,et al.  Monte Carlo Statistical Methods , 2000, Technometrics.

[36]  Geoffrey E. Hinton,et al.  Feudal Reinforcement Learning , 1992, NIPS.

[37]  A. Kamiya,et al.  Learning of communication codes in multi-agent reinforcement learning problem , 2008, 2008 IEEE Conference on Soft Computing in Industrial Applications.

[38]  Shimon Whiteson,et al.  QMIX: Monotonic Value Function Factorisation for Deep Multi-Agent Reinforcement Learning , 2018, ICML.

[39]  Peng Peng,et al.  Multiagent Bidirectionally-Coordinated Nets: Emergence of Human-level Coordination in Learning to Play StarCraft Combat Games , 2017, 1703.10069.

[40]  Shimon Whiteson,et al.  Learning to Communicate with Deep Multi-Agent Reinforcement Learning , 2016, NIPS.

[41]  Guy Lever,et al.  Value-Decomposition Networks For Cooperative Multi-Agent Learning Based On Team Reward , 2018, AAMAS.

[42]  Shimon Whiteson,et al.  Stabilising Experience Replay for Deep Multi-Agent Reinforcement Learning , 2017, ICML.

[43]  Kate Saenko,et al.  Hierarchical Reinforcement Learning with Hindsight , 2018, ArXiv.

[44]  Rémi Munos,et al.  Recurrent Experience Replay in Distributed Reinforcement Learning , 2018, ICLR.