Value Propagation for Decentralized Networked Deep Multi-agent Reinforcement Learning

We consider the networked multi-agent reinforcement learning (MARL) problem in a fully decentralized setting, where agents learn to coordinate to achieve the joint success. This problem is widely encountered in many areas including traffic control, distributed control, and smart grids. We assume that the reward function for each agent can be different and observed only locally by the agent itself. Furthermore, each agent is located at a node of a communication network and can exchanges information only with its neighbors. Using softmax temporal consistency and a decentralized optimization method, we obtain a principled and data-efficient iterative algorithm. In the first step of each iteration, an agent computes its local policy and value gradients and then updates only policy parameters. In the second step, the agent propagates to its neighbors the messages based on its value function and then updates its own value function. Hence we name the algorithm value propagation. We prove a non-asymptotic convergence rate 1/T with the nonlinear function approximation. To the best of our knowledge, it is the first MARL algorithm with convergence guarantee in the control, off-policy and non-linear function approximation setting. We empirically demonstrate the effectiveness of our approach in experiments.

[1]  Saeed Ghadimi,et al.  Accelerated gradient methods for nonconvex nonlinear and stochastic programming , 2013, Mathematical Programming.

[2]  Ali H. Sayed,et al.  Diffusion recursive least-squares for distributed estimation over adaptive networks , 2008, IEEE Transactions on Signal Processing.

[3]  Sergey Levine,et al.  InfoBot: Transfer and Exploration via the Information Bottleneck , 2019, ICLR.

[4]  Richard M. Murray,et al.  Privacy preserving average consensus , 2014, 53rd IEEE Conference on Decision and Control.

[5]  Ian A. Hiskens,et al.  Achieving Controllability of Electric Loads , 2011, Proceedings of the IEEE.

[6]  Martin Lauer,et al.  An Algorithm for Distributed Reinforcement Learning in Cooperative Multi-Agent Systems , 2000, ICML.

[7]  Tiejun Huang,et al.  Graph Convolutional Reinforcement Learning , 2020, ICLR.

[8]  Sonia Martínez,et al.  Coverage control for mobile sensing networks , 2002, IEEE Transactions on Robotics and Automation.

[9]  Michael L. Littman,et al.  Markov Games as a Framework for Multi-Agent Reinforcement Learning , 1994, ICML.

[10]  Blaise Agüera y Arcas,et al.  Communication-Efficient Learning of Deep Networks from Decentralized Data , 2016, AISTATS.

[11]  Qing Ling,et al.  EXTRA: An Exact First-Order Algorithm for Decentralized Consensus Optimization , 2014, 1404.6264.

[12]  Baher Abdulhai,et al.  Multiagent Reinforcement Learning for Integrated Network of Adaptive Traffic Signal Controllers (MARLIN-ATSC): Methodology and Large-Scale Application on Downtown Toronto , 2013, IEEE Transactions on Intelligent Transportation Systems.

[13]  Zongqing Lu,et al.  Graph Convolutional Reinforcement Learning for Multi-Agent Cooperation , 2018, ArXiv.

[14]  Zhuoran Yang,et al.  Multi-Agent Reinforcement Learning via Double Averaging Primal-Dual Optimization , 2018, NeurIPS.

[15]  Mingyi Hong,et al.  Decomposing Linearly Constrained Nonconvex Problems by a Proximal Primal Dual Approach: Algorithms, Convergence, and Applications , 2016, ArXiv.

[16]  Le Song,et al.  SBEED: Convergent Reinforcement Learning with Nonlinear Function Approximation , 2017, ICML.

[17]  Mingyi Hong,et al.  Prox-PDA: The Proximal Primal-Dual Algorithm for Fast Distributed Nonconvex Optimization and Learning Over Networks , 2017, ICML.

[18]  Stephen P. Boyd,et al.  A scheme for robust distributed sensor fusion based on average consensus , 2005, IPSN 2005. Fourth International Symposium on Information Processing in Sensor Networks, 2005..

[19]  Tamer Basar,et al.  Fully Decentralized Multi-Agent Reinforcement Learning with Networked Agents , 2018, ICML.

[20]  Bart De Schutter,et al.  A Comprehensive Survey of Multiagent Reinforcement Learning , 2008, IEEE Transactions on Systems, Man, and Cybernetics, Part C (Applications and Reviews).

[21]  Jan Peters,et al.  Policy evaluation with temporal differences: a survey and comparison , 2015, J. Mach. Learn. Res..

[22]  Ian Goodfellow,et al.  Deep Learning with Differential Privacy , 2016, CCS.

[23]  Shimon Whiteson,et al.  Counterfactual Multi-Agent Policy Gradients , 2017, AAAI.

[24]  Yi Wu,et al.  Multi-Agent Actor-Critic for Mixed Cooperative-Competitive Environments , 2017, NIPS.

[25]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[26]  Dale Schuurmans,et al.  Bridging the Gap Between Value and Policy Based Reinforcement Learning , 2017, NIPS.

[27]  Serdar Yüksel,et al.  Decentralized Q-Learning for Stochastic Teams and Games , 2015, IEEE Transactions on Automatic Control.

[28]  Alexander Shapiro,et al.  Lectures on Stochastic Programming: Modeling and Theory , 2009 .

[29]  Robert Nowak,et al.  Distributed optimization in sensor networks , 2004, Third International Symposium on Information Processing in Sensor Networks, 2004. IPSN 2004.

[30]  Samy Bengio,et al.  Adversarial Machine Learning at Scale , 2016, ICLR.

[31]  Victor J. Blue,et al.  A COOPERATIVE MULTI-AGENT TRANSPORTATION MANAGEMENT AND ROUTE GUIDANCE SYSTEM , 2002 .

[32]  Yoram Singer,et al.  Adaptive Subgradient Methods for Online Learning and Stochastic Optimization , 2011, J. Mach. Learn. Res..

[33]  Richard M. Murray,et al.  Information flow and cooperative control of vehicle formations , 2004, IEEE Transactions on Automatic Control.

[34]  Csaba Szepesvári,et al.  Learning near-optimal policies with Bellman-residual minimization based fitted policy iteration and a single sample path , 2006, Machine Learning.

[35]  Rob Fergus,et al.  Modeling Others using Oneself in Multi-Agent Reinforcement Learning , 2018, ICML.

[36]  Shimon Whiteson,et al.  Learning to Communicate with Deep Multi-Agent Reinforcement Learning , 2016, NIPS.

[37]  Michael P. Wellman,et al.  Nash Q-Learning for General-Sum Stochastic Games , 2003, J. Mach. Learn. Res..

[38]  Stephen P. Boyd,et al.  Randomized gossip algorithms , 2006, IEEE Transactions on Information Theory.

[39]  Shimon Whiteson,et al.  QMIX: Monotonic Value Function Factorisation for Deep Multi-Agent Reinforcement Learning , 2018, ICML.

[40]  Yee Whye Teh,et al.  Distral: Robust multitask reinforcement learning , 2017, NIPS.

[41]  M. Degroot Reaching a Consensus , 1974 .