Learning Two-Player Mixture Markov Games: Kernel Function Approximation and Correlated Equilibrium

We consider learning Nash equilibria in two-player zero-sum Markov Games with nonlinear function approximation, where the action-value function is approximated by a function in a Reproducing Kernel Hilbert Space (RKHS). The key challenge is how to do exploration in the high-dimensional function space. We propose a novel online learning algorithm to find a Nash equilibrium by minimizing the duality gap. At the core of our algorithms are upper and lower confidence bounds that are derived based on the principle of optimism in the face of uncertainty. We prove that our algorithm is able to attain an O ( √ T ) regret with polynomial computational complexity, under very mild assumptions on the reward function and the underlying dynamic of the Markov Games. We also propose several extensions of our algorithm, including an algorithm with Bernstein-type bonus that can achieve a tighter regret bound, and another algorithm for model misspecification that can be applied to neural function approximation.

[1]  Zhuoran Yang,et al.  Towards General Function Approximation in Zero-Sum Markov Games , 2021, ICLR.

[2]  Tiancheng Yu,et al.  The Power of Exploiter: Provable Multi-Agent RL in Large State Spaces , 2021, ICML.

[3]  Chi Jin,et al.  V-Learning - A Simple, Efficient, Decentralized Algorithm for Multiagent RL , 2021, ArXiv.

[4]  Zhuoran Yang,et al.  On Reward-Free RL with Kernel and Neural Function Approximations: Single-Agent MDP and Markov Game , 2021, ICML.

[5]  Shachar Lovett,et al.  Bilinear Classes: A Structural Framework for Provable Generalization in RL , 2021, ICML.

[6]  Chi Jin,et al.  Bellman Eluder Dimension: New Rich Classes of RL Problems, and Sample-Efficient Algorithms , 2021, NeurIPS.

[7]  Quanquan Gu,et al.  Nearly Minimax Optimal Reinforcement Learning for Linear Mixture Markov Decision Processes , 2020, COLT.

[8]  Lin F. Yang,et al.  Minimax Sample Complexity for Turn-based Stochastic Game , 2020, UAI.

[9]  Qinghua Liu,et al.  A Sharp Analysis of Model-based Reinforcement Learning with Self-Play , 2020, ICML.

[10]  Quanquan Gu,et al.  Provably Efficient Reinforcement Learning for Discounted MDPs with Feature Mapping , 2020, ICML.

[11]  Quanquan Gu,et al.  Almost Optimal Algorithms for Two-player Markov Games with Linear Function Approximation , 2021, ArXiv.

[12]  Michael I. Jordan,et al.  On Function Approximation in Reinforcement Learning: Optimism in the Face of Large State Spaces , 2021 .

[13]  Chi Jin,et al.  Near-Optimal Reinforcement Learning with Self-Play , 2020, NeurIPS.

[14]  Mengdi Wang,et al.  Model-Based Reinforcement Learning with Value-Targeted Regression , 2020, L4DC.

[15]  Lin F. Yang,et al.  Reinforcement Learning with General Value Function Approximation: Provably Efficient Approach via Bounded Eluder Dimension , 2020, NeurIPS.

[16]  Mykel J. Kochenderfer,et al.  Learning Near Optimal Policies with Low Inherent Bellman Error , 2020, ICML.

[17]  Zhuoran Yang,et al.  Learning Zero-Sum Simultaneous-Move Markov Games Using Function Approximation and Correlated Equilibrium , 2020, COLT.

[18]  Chi Jin,et al.  Provable Self-Play Algorithms for Competitive Reinforcement Learning , 2020, ICML.

[19]  Quanquan Gu,et al.  Neural Contextual Bandits with UCB-based Exploration , 2019, ICML.

[20]  Ambuj Tewari,et al.  Sample Complexity of Reinforcement Learning using Linearly Combined Model Ensembles , 2019, AISTATS.

[21]  Lin F. Yang,et al.  Solving Discounted Stochastic Two-Player Games with Near-Optimal Time and Sample Complexity , 2019, AISTATS.

[22]  Michael I. Jordan,et al.  Provably Efficient Reinforcement Learning with Linear Function Approximation , 2019, COLT.

[23]  Mengdi Wang,et al.  Reinforcement Leaning in Feature Space: Matrix Bandit, Kernels, and Regret Bound , 2019, ICML.

[24]  Wojciech M. Czarnecki,et al.  Grandmaster level in StarCraft II using multi-agent reinforcement learning , 2019, Nature.

[25]  Noam Brown,et al.  Superhuman AI for multiplayer poker , 2019, Science.

[26]  Mengdi Wang,et al.  Feature-Based Q-Learning for Two-Player Stochastic Games , 2019, ArXiv.

[27]  Yuan Cao,et al.  Generalization Bounds of Stochastic Gradient Descent for Wide and Deep Neural Networks , 2019, NeurIPS.

[28]  Nan Jiang,et al.  Model-based RL in Contextual Decision Processes: PAC bounds and Exponential Improvements over Model-free Approaches , 2018, COLT.

[29]  Yin Tat Lee,et al.  Solving linear programs in the current matrix multiplication time , 2018, STOC.

[30]  Olivier Pietquin,et al.  Actor-Critic Fictitious Play in Simultaneous Move Multistage Games , 2018, AISTATS.

[31]  Chen-Yu Wei,et al.  Online Reinforcement Learning in Stochastic Games , 2017, NIPS.

[32]  Aditya Gopalan,et al.  On Kernelized Multi-armed Bandits , 2017, ICML.

[33]  Nan Jiang,et al.  Contextual Decision Processes with low Bellman rank are PAC-Learnable , 2016, ICML.

[34]  Olivier Pietquin,et al.  Learning Nash Equilibrium for General-Sum Markov Games from Batch Data , 2016, AISTATS.

[35]  Amnon Shashua,et al.  Safe, Multi-Agent, Reinforcement Learning for Autonomous Driving , 2016, ArXiv.

[36]  Matthieu Geist,et al.  Softened Approximate Policy Iteration for Markov Games , 2016, ICML.

[37]  Bruno Scherrer,et al.  On the Use of Non-Stationary Strategies for Solving Two-Player Zero-Sum Markov Games , 2016, AISTATS.

[38]  Demis Hassabis,et al.  Mastering the game of Go with deep neural networks and tree search , 2016, Nature.

[39]  Bruno Scherrer,et al.  Approximate Dynamic Programming for Two-Player Zero-Sum Markov Games , 2015, ICML.

[40]  Benjamin Van Roy,et al.  Eluder Dimension and the Sample Complexity of Optimistic Exploration , 2013, NIPS.

[41]  Csaba Szepesvári,et al.  Improved Algorithms for Linear Stochastic Bandits , 2011, NIPS.

[42]  Andreas Krause,et al.  Information-Theoretic Regret Bounds for Gaussian Process Optimization in the Bandit Setting , 2009, IEEE Transactions on Information Theory.

[43]  Michail G. Lagoudakis,et al.  Value Function Approximation in Zero-Sum Markov Games , 2002, UAI.

[44]  Narendra Karmarkar,et al.  A new polynomial-time algorithm for linear programming , 1984, STOC '84.

[45]  L. Shapley,et al.  Stochastic Games* , 1953, Proceedings of the National Academy of Sciences.