Provably Efficient Algorithms for Multi-Objective Competitive RL

We study multi-objective reinforcement learning (RL) where an agent’s reward is represented as a vector. In settings where an agent competes against opponents, its performance is measured by the distance of its average return vector to a target set. We develop statistically and computationally efficient algorithms to approach the associated target set. Our results extend Blackwell’s approachability theorem (Blackwell, 1956) to tabular RL, where strategic exploration becomes essential. The algorithms presented are adaptive; their guarantees hold even without Blackwell’s approachability condition. If the opponents use fixed policies, we give an improved rate of approaching the target set while also tackling the more ambitious goal of simultaneously minimizing a scalar cost function. We discuss our analysis for this special case by relating our results to previous works on constrained RL. To our knowledge, this work provides the first provably efficient algorithms for vector-valued Markov games and our theoretical guarantees are near-optimal.

[1]  Benjamin Van Roy,et al.  Near-optimal Reinforcement Learning in Factored MDPs , 2014, NIPS.

[2]  Ness B. Shroff,et al.  Learning in Markov Decision Processes under Constraints , 2020, ArXiv.

[3]  Max Simchowitz,et al.  Constrained episodic reinforcement learning in concave-convex and knapsack settings , 2020, NeurIPS.

[4]  Chi Jin,et al.  Near-Optimal Reinforcement Learning with Self-Play , 2020, NeurIPS.

[5]  Suvrit Sra,et al.  Towards Minimax Optimal Reinforcement Learning in Factored Markov Decision Processes , 2020, NeurIPS.

[6]  Shie Mannor,et al.  Exploration-Exploitation in Constrained MDPs , 2020, ArXiv.

[7]  Michal Valko,et al.  Episodic Reinforcement Learning in Finite MDPs: Minimax Lower Bounds Revisited , 2021, ALT.

[8]  Michael L. Littman,et al.  Markov Games as a Framework for Multi-Agent Reinforcement Learning , 1994, ICML.

[9]  David Simchi-Levi,et al.  Non-Stationary Reinforcement Learning: The Blessing of (More) Optimism , 2019 .

[10]  Tiancheng Yu,et al.  Provably Efficient Online Agnostic Learning in Markov Games , 2020, ArXiv.

[11]  Massimiliano Pontil,et al.  Empirical Bernstein Bounds and Sample-Variance Penalization , 2009, COLT.

[12]  丸山 徹 Convex Analysisの二,三の進展について , 1977 .

[13]  D. Blackwell An analog of the minimax theorem for vector payoffs. , 1956 .

[14]  Xiaohan Wei,et al.  Provably Efficient Safe Exploration via Primal-Dual Policy Optimization , 2021, AISTATS.

[15]  Rémi Munos,et al.  Minimax Regret Bounds for Reinforcement Learning , 2017, ICML.

[16]  Qinghua Liu,et al.  A Sharp Analysis of Model-based Reinforcement Learning with Self-Play , 2020, ICML.

[17]  Chi Jin,et al.  Provable Self-Play Algorithms for Competitive Reinforcement Learning , 2020, ICML.

[18]  J. Neumann Zur Theorie der Gesellschaftsspiele , 1928 .

[19]  Peter L. Bartlett,et al.  Blackwell Approachability and No-Regret Learning are Equivalent , 2010, COLT.

[20]  Lihong Li,et al.  Efficient Reinforcement Learning in Factored MDPs with Application to Constrained RL , 2020, ICLR.

[21]  Aleksandrs Slivkins,et al.  Bandits with Knapsacks , 2013, 2013 IEEE 54th Annual Symposium on Foundations of Computer Science.

[22]  Ronen I. Brafman,et al.  R-MAX - A General Polynomial Time Algorithm for Near-Optimal Reinforcement Learning , 2001, J. Mach. Learn. Res..

[23]  Shie Mannor,et al.  Approachability in unknown games: Online learning meets multi-objective optimization , 2014, COLT.

[24]  Jianjun Yuan,et al.  Online Convex Optimization for Cumulative Constraints , 2018, NeurIPS.

[25]  Nikhil R. Devanur,et al.  Bandits with concave rewards and convex knapsacks , 2014, EC.

[26]  Akshay Krishnamurthy,et al.  Reward-Free Exploration for Reinforcement Learning , 2020, ICML.

[27]  Cédric Archambeau,et al.  Adaptive Algorithms for Online Convex Optimization with Long-term Constraints , 2015, ICML.

[28]  Nahum Shimkin,et al.  An Online Convex Optimization Approach to Blackwell's Approachability , 2015, J. Mach. Learn. Res..

[29]  L. Shapley,et al.  Stochastic Games* , 1953, Proceedings of the National Academy of Sciences.

[30]  Michael I. Jordan,et al.  Is Q-learning Provably Efficient? , 2018, NeurIPS.

[31]  Qiaomin Xie,et al.  Learning Zero-Sum Simultaneous-Move Markov Games Using Function Approximation and Correlated Equilibrium , 2020, COLT 2020.

[32]  Lin F. Yang,et al.  Accommodating Picky Customers: Regret Bound and Exploration Complexity for Multi-Objective Reinforcement Learning , 2020, NeurIPS.

[33]  Miroslav Dudík,et al.  Reinforcement Learning with Convex Constraints , 2019, NeurIPS.