Solving Continuous Control via Q-learning

While there has been substantial success in applying actor-critic methods to continuous control, simpler critic-only methods such as Q-learning often remain intractable in the associated high-dimensional action spaces. However, most actorcritic methods come at the cost of added complexity: heuristics for stabilisation, compute requirements as well as wider hyperparameter search spaces. We show that these issues can be largely alleviated via Q-learning by combining action discretization with value decomposition, framing single-agent control as cooperative multi-agent reinforcement learning (MARL). With bang-bang actions, performance of this critic-only approach matches state-of-the-art continuous actor-critic methods when learning from features or pixels. We extend classical bandit examples from cooperative MARL to provide intuition for how decoupled critics leverage state information to coordinate joint optimization, and demonstrate surprisingly strong performance across a wide variety of continuous control tasks.2 Figure 1: Q-learning yields state-of-the-art performance on various continuous control benchmarks. Simply combining bang-bang action discretization with full value decomposition scales to highdimensional control tasks and recovers performance competitive with recent actor-critic methods. Our Decoupled Q-Networks (DecQN) thereby constitute a concise baseline agent to highlight the power of simplicity and to help put recent advances in learning continuous control into perspective.

[1]  Yashraj S. Narang,et al.  Accelerated Policy Learning with Parallel Differentiable Simulation , 2022, ICLR.

[2]  Xiaolong Wang,et al.  Temporal Difference Learning for Model Predictive Control , 2022, ICML.

[3]  Alessandro Lazaric,et al.  Mastering Visual Continuous Control: Improved Data-Augmented Reinforcement Learning , 2021, ICLR.

[4]  Igor Gilitschenski,et al.  Is Bang-Bang Control All You Need? Solving Continuous Control with Bernoulli Policies , 2021, NeurIPS.

[5]  Philipp Reist,et al.  Learning to Walk in Minutes Using Massively Parallel Deep Reinforcement Learning , 2021, CoRL.

[6]  Miles Macklin,et al.  Isaac Gym: High Performance GPU-Based Physics Simulation For Robot Learning , 2021, NeurIPS Datasets and Benchmarks.

[7]  Stefano V. Albrecht,et al.  Scaling Multi-Agent Reinforcement Learning with Selective Parameter Sharing , 2021, 2102.07475.

[8]  Ankush Gupta,et al.  Representation Matters: Improving Perception and Exploration for Robotics , 2020, 2021 IEEE International Conference on Robotics and Automation (ICRA).

[9]  Petar Kormushev,et al.  Learning to Represent Action Values as a Hypergraph on the Action Vertices , 2020, ICLR.

[10]  S. Karaman,et al.  Learning to Plan Optimistically: Uncertainty-Guided Deep Exploration via Latent Model Ensembles , 2020, CoRL.

[11]  Mohammad Norouzi,et al.  Mastering Atari with Discrete World Models , 2020, ICLR.

[12]  Stephen C. Adams,et al.  Value-Decomposition Multi-Agent Actor-Critics , 2020, AAAI.

[13]  Shimon Whiteson,et al.  FACMAC: Factored Multi-Agent Centralised Policy Gradients , 2020, NeurIPS.

[14]  Beining Han,et al.  DOP: Off-Policy Multi-Agent Decomposed Policy Gradients , 2021, ICLR.

[15]  Timothy Verstraeten,et al.  Cooperative Prioritized Sweeping , 2021, AAMAS.

[16]  Yuval Tassa,et al.  dm_control: Software and Tasks for Continuous Control , 2020, Softw. Impacts.

[17]  Sergio Gomez Colmenarejo,et al.  Acme: A Research Framework for Distributed Reinforcement Learning , 2020, ArXiv.

[18]  Pieter Abbeel,et al.  Planning to Explore via Self-Supervised World Models , 2020, ICML.

[19]  Pieter Abbeel,et al.  CURL: Contrastive Unsupervised Representations for Reinforcement Learning , 2020, ICML.

[20]  Andriy Mnih,et al.  Q-Learning in enormous action spaces via amortized approximate maximization , 2020, ArXiv.

[21]  Jimmy Ba,et al.  Dream to Control: Learning Behaviors by Latent Imagination , 2019, ICLR.

[22]  S. Whiteson,et al.  Deep Coordination Graphs , 2019, ICML.

[23]  Shimon Whiteson,et al.  Growing Action Spaces , 2019, ICML.

[24]  Yunhao Tang,et al.  Discretizing Continuous Action Space for On-Policy Optimization , 2019, AAAI.

[25]  Martin A. Riedmiller,et al.  Continuous-Discrete Reinforcement Learning for Hybrid Control in Robotics , 2020, CoRL.

[26]  S. Levine,et al.  Meta-World: A Benchmark and Evaluation for Multi-Task and Meta Reinforcement Learning , 2019, CoRL.

[27]  Shimon Whiteson,et al.  Exploration with Unreliable Intrinsic Reward in Multi-Agent Reinforcement Learning , 2019, ArXiv.

[28]  Sangbae Kim,et al.  Mini Cheetah: A Platform for Pushing the Limits of Dynamic Quadruped Control , 2019, 2019 International Conference on Robotics and Automation (ICRA).

[29]  Yung Yi,et al.  QTRAN: Learning to Factorize with Transformation for Cooperative Multi-Agent Reinforcement Learning , 2019, ICML.

[30]  Henry Zhu,et al.  Soft Actor-Critic Algorithms and Applications , 2018, ArXiv.

[31]  Sergey Levine,et al.  QT-Opt: Scalable Deep Reinforcement Learning for Vision-Based Robotic Manipulation , 2018, CoRL.

[32]  Shimon Whiteson,et al.  QMIX: Monotonic Value Function Factorisation for Deep Multi-Agent Reinforcement Learning , 2018, ICML.

[33]  Yuval Tassa,et al.  Maximum a Posteriori Policy Optimisation , 2018, ICLR.

[34]  Matthew W. Hoffman,et al.  Distributed Distributional Deterministic Policy Gradients , 2018, ICLR.

[35]  Arash Tavakoli,et al.  Action Branching Architectures for Deep Reinforcement Learning , 2017, AAAI.

[36]  Tom Schaul,et al.  Rainbow: Combining Improvements in Deep Reinforcement Learning , 2017, AAAI.

[37]  Guy Lever,et al.  Value-Decomposition Networks For Cooperative Multi-Agent Learning Based On Team Reward , 2018, AAMAS.

[38]  Xiangxiang Chu,et al.  Parameter Sharing Deep Deterministic Policy Gradient for Cooperative Multi-agent Reinforcement Learning , 2017, ArXiv.

[39]  Alec Radford,et al.  Proximal Policy Optimization Algorithms , 2017, ArXiv.

[40]  Marc G. Bellemare,et al.  A Distributional Perspective on Reinforcement Learning , 2017, ICML.

[41]  Romain Laroche,et al.  Hybrid Reward Architecture for Reinforcement Learning , 2017, NIPS.

[42]  Yi Wu,et al.  Multi-Agent Actor-Critic for Mixed Cooperative-Competitive Environments , 2017, NIPS.

[43]  Balaraman Ravindran,et al.  Learning to Factor Policies and Action-Value Functions: Factored Action Space Representations for Deep Reinforcement learning , 2017, ArXiv.

[44]  Alexei A. Efros,et al.  Curiosity-Driven Exploration by Self-Supervised Prediction , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition Workshops (CVPRW).

[45]  Navdeep Jaitly,et al.  Discrete Sequential Prediction of Continuous Actions for Deep RL , 2017, ArXiv.

[46]  Mykel J. Kochenderfer,et al.  Cooperative Multi-agent Control Using Deep Reinforcement Learning , 2017, AAMAS Workshops.

[47]  Shimon Whiteson,et al.  Stabilising Experience Replay for Deep Multi-Agent Reinforcement Learning , 2017, ICML.

[48]  Benjamin Van Roy,et al.  Deep Exploration via Bootstrapped DQN , 2016, NIPS.

[49]  Tom Schaul,et al.  Prioritized Experience Replay , 2015, ICLR.

[50]  Shane Legg,et al.  Human-level control through deep reinforcement learning , 2015, Nature.

[51]  Sergey Levine,et al.  Trust Region Policy Optimization , 2015, ICML.

[52]  Alex Graves,et al.  Playing Atari with Deep Reinforcement Learning , 2013, ArXiv.

[53]  Guillaume J. Laurent,et al.  Independent reinforcement learners in cooperative Markov games: a survey regarding coordination problems , 2012, The Knowledge Engineering Review.

[54]  Markus Wulfmeier,et al.  Strength Through Diversity: Robust Behavior Learning via Mixture Policies , 2010 .

[55]  Guillaume J. Laurent,et al.  Hysteretic q-learning :an algorithm for decentralized reinforcement learning in cooperative multi-agent teams , 2007, 2007 IEEE/RSJ International Conference on Intelligent Robots and Systems.

[56]  Bart De Schutter,et al.  Decentralized Reinforcement Learning Control of a Robotic Manipulator , 2006, 2006 9th International Conference on Control, Automation, Robotics and Vision.

[57]  Nikos A. Vlassis,et al.  Collaborative Multiagent Reinforcement Learning by Payoff Propagation , 2006, J. Mach. Learn. Res..

[58]  Sean Luke,et al.  Lenient learners in cooperative multiagent systems , 2006, AAMAS '06.

[59]  Richard S. Sutton,et al.  Reinforcement Learning: An Introduction , 1998, IEEE Trans. Neural Networks.

[60]  Peter Dayan,et al.  Q-learning , 1992, Machine Learning.

[61]  Stuart J. Russell,et al.  Q-Decomposition for Reinforcement Learning Agents , 2003, ICML.

[62]  Michail G. Lagoudakis,et al.  Coordinated Reinforcement Learning , 2002, ICML.

[63]  Martin Lauer,et al.  An Algorithm for Distributed Reinforcement Learning in Cooperative Multi-Agent Systems , 2000, ICML.

[64]  Andrew W. Moore,et al.  Distributed Value Functions , 1999, ICML.

[65]  Craig Boutilier,et al.  The Dynamics of Reinforcement Learning in Cooperative Multiagent Systems , 1998, AAAI/IAAI.

[66]  Richard S. Sutton,et al.  Temporal credit assignment in reinforcement learning , 1984 .