Learning Adversarial Markov Decision Processes with Bandit Feedback and Unknown Transition

We consider the problem of learning in episodic finite-horizon Markov decision processes with unknown transition function, bandit feedback, and adversarial losses. We propose an efficient algorithm that achieves Õ(L|X |2 √ |A|T ) regret with high probability, where L is the horizon, |X | is the number of states, |A| is the number of actions, and T is the number of episodes. To the best of our knowledge, our algorithm is the first one to ensure Õ( √ T ) regret in this challenging setting. Our key technical contribution is to introduce an optimistic loss estimator that is inversely weighted by an upper occupancy bound.

[1]  Gergely Neu,et al.  Online learning in episodic Markovian decision processes by relative entropy policy search , 2013, NIPS.

[2]  Xiangyang Ji,et al.  Regret Minimization for Reinforcement Learning by Evaluating the Optimal Bias Function , 2019, NeurIPS.

[3]  Yishay Mansour,et al.  Online Markov Decision Processes , 2009, Math. Oper. Res..

[4]  Yishay Mansour,et al.  Online Convex Optimization in Adversarial Markov Decision Processes , 2019, ICML.

[5]  Yishay Mansour,et al.  Online Stochastic Shortest Path with Bandit Feedback and Unknown Transition Function , 2019, NeurIPS.

[6]  Peter Auer,et al.  Finite-time Analysis of the Multiarmed Bandit Problem , 2002, Machine Learning.

[7]  Shie Mannor,et al.  Markov Decision Processes with Arbitrary Reward Processes , 2009, Math. Oper. Res..

[8]  John Langford,et al.  Contextual Bandit Algorithms with Supervised Learning Guarantees , 2010, AISTATS.

[9]  E. Altman Constrained Markov Decision Processes , 1999 .

[10]  David Simchi-Levi,et al.  Reinforcement Learning under Drift , 2019, ArXiv.

[11]  Michael I. Jordan,et al.  Is Q-learning Provably Efficient? , 2018, NeurIPS.

[12]  Gergely Neu,et al.  Explore no more: Improved high-probability regret bounds for non-stochastic bandits , 2015, NIPS.

[13]  Ambuj Tewari,et al.  Deterministic MDPs with Adversarial Rewards and Bandit Feedback , 2012, UAI.

[14]  Yi Ouyang,et al.  Learning Unknown Markov Decision Processes: A Thompson Sampling Approach , 2017, NIPS.

[15]  Haipeng Luo,et al.  Model-free Reinforcement Learning in Infinite-horizon Average-reward Markov Decision Processes , 2020, ICML.

[16]  Apostolos Burnetas,et al.  Optimal Adaptive Policies for Markov Decision Processes , 1997, Math. Oper. Res..

[17]  Massimiliano Pontil,et al.  Empirical Bernstein Bounds and Sample-Variance Penalization , 2009, COLT.

[18]  Shie Mannor,et al.  Arbitrarily modulated Markov decision processes , 2009, Proceedings of the 48h IEEE Conference on Decision and Control (CDC) held jointly with 2009 28th Chinese Control Conference.

[19]  Peter Auer,et al.  Near-optimal Regret Bounds for Reinforcement Learning , 2008, J. Mach. Learn. Res..

[20]  Csaba Szepesvári,et al.  Improved Algorithms for Linear Stochastic Bandits , 2011, NIPS.

[21]  Elad Hazan,et al.  Competing in the Dark: An Efficient Algorithm for Bandit Linear Optimization , 2008, COLT.

[22]  Aleksandrs Slivkins,et al.  Corruption Robust Exploration in Episodic Reinforcement Learning , 2019, ArXiv.

[23]  Wei Chu,et al.  Contextual Bandits with Linear Payoff Functions , 2011, AISTATS.

[24]  Elad Hazan,et al.  Introduction to Online Convex Optimization , 2016, Found. Trends Optim..

[25]  Rémi Munos,et al.  Minimax Regret Bounds for Reinforcement Learning , 2017, ICML.

[26]  Csaba Szepesvári,et al.  Online Markov Decision Processes Under Bandit Feedback , 2010, IEEE Transactions on Automatic Control.

[27]  Peter Auer,et al.  Hannan Consistency in On-Line Learning in Case of Unbounded Losses Under Partial Monitoring , 2006, ALT.

[28]  Yuval Peres,et al.  Bandits with switching costs: T2/3 regret , 2013, STOC.

[29]  Alessandro Lazaric,et al.  Efficient Bias-Span-Constrained Exploration-Exploitation in Reinforcement Learning , 2018, ICML.

[30]  Elad Hazan,et al.  Better Rates for Any Adversarial Deterministic MDP , 2013, ICML.

[31]  Xiaoyu Chen,et al.  Q-learning with UCB Exploration is Sample Efficient for Infinite-Horizon MDP , 2019, ICLR.

[32]  Peter Auer,et al.  The Nonstochastic Multiarmed Bandit Problem , 2002, SIAM J. Comput..

[33]  András György,et al.  The adversarial stochastic shortest path problem with unknown transition probabilities , 2012, AISTATS.