论文信息 - Learning Adversarial Markov Decision Processes with Bandit Feedback and Unknown Transition

Learning Adversarial Markov Decision Processes with Bandit Feedback and Unknown Transition

We consider the problem of learning in episodic finite-horizon Markov decision processes with unknown transition function, bandit feedback, and adversarial losses. We propose an efficient algorithm that achieves Õ(L|X |2 √ |A|T ) regret with high probability, where L is the horizon, |X | is the number of states, |A| is the number of actions, and T is the number of episodes. To the best of our knowledge, our algorithm is the first one to ensure Õ( √ T ) regret in this challenging setting. Our key technical contribution is to introduce an optimistic loss estimator that is inversely weighted by an upper occupancy bound.

Haipeng Luo | Suvrit Sra | Chi Jin | Tiancheng Jin | Tiancheng Yu

[1] Gergely Neu,et al. Online learning in episodic Markovian decision processes by relative entropy policy search , 2013, NIPS.

[2] Xiangyang Ji,et al. Regret Minimization for Reinforcement Learning by Evaluating the Optimal Bias Function , 2019, NeurIPS.

[3] Yishay Mansour,et al. Online Markov Decision Processes , 2009, Math. Oper. Res..

[4] Yishay Mansour,et al. Online Convex Optimization in Adversarial Markov Decision Processes , 2019, ICML.

[5] Yishay Mansour,et al. Online Stochastic Shortest Path with Bandit Feedback and Unknown Transition Function , 2019, NeurIPS.

[6] Peter Auer,et al. Finite-time Analysis of the Multiarmed Bandit Problem , 2002, Machine Learning.

[7] Shie Mannor,et al. Markov Decision Processes with Arbitrary Reward Processes , 2009, Math. Oper. Res..

[8] John Langford,et al. Contextual Bandit Algorithms with Supervised Learning Guarantees , 2010, AISTATS.

[9] E. Altman. Constrained Markov Decision Processes , 1999 .

[10] David Simchi-Levi,et al. Reinforcement Learning under Drift , 2019, ArXiv.

[11] Michael I. Jordan,et al. Is Q-learning Provably Efficient? , 2018, NeurIPS.

[12] Gergely Neu,et al. Explore no more: Improved high-probability regret bounds for non-stochastic bandits , 2015, NIPS.

[13] Ambuj Tewari,et al. Deterministic MDPs with Adversarial Rewards and Bandit Feedback , 2012, UAI.

[14] Yi Ouyang,et al. Learning Unknown Markov Decision Processes: A Thompson Sampling Approach , 2017, NIPS.

[15] Haipeng Luo,et al. Model-free Reinforcement Learning in Infinite-horizon Average-reward Markov Decision Processes , 2020, ICML.

[16] Apostolos Burnetas,et al. Optimal Adaptive Policies for Markov Decision Processes , 1997, Math. Oper. Res..

[17] Massimiliano Pontil,et al. Empirical Bernstein Bounds and Sample-Variance Penalization , 2009, COLT.

[18] Shie Mannor,et al. Arbitrarily modulated Markov decision processes , 2009, Proceedings of the 48h IEEE Conference on Decision and Control (CDC) held jointly with 2009 28th Chinese Control Conference.

[19] Peter Auer,et al. Near-optimal Regret Bounds for Reinforcement Learning , 2008, J. Mach. Learn. Res..

[20] Csaba Szepesvári,et al. Improved Algorithms for Linear Stochastic Bandits , 2011, NIPS.

[21] Elad Hazan,et al. Competing in the Dark: An Efficient Algorithm for Bandit Linear Optimization , 2008, COLT.

[22] Aleksandrs Slivkins,et al. Corruption Robust Exploration in Episodic Reinforcement Learning , 2019, ArXiv.

[23] Wei Chu,et al. Contextual Bandits with Linear Payoff Functions , 2011, AISTATS.

[24] Elad Hazan,et al. Introduction to Online Convex Optimization , 2016, Found. Trends Optim..

[25] Rémi Munos,et al. Minimax Regret Bounds for Reinforcement Learning , 2017, ICML.

[26] Csaba Szepesvári,et al. Online Markov Decision Processes Under Bandit Feedback , 2010, IEEE Transactions on Automatic Control.

[27] Peter Auer,et al. Hannan Consistency in On-Line Learning in Case of Unbounded Losses Under Partial Monitoring , 2006, ALT.

[28] Yuval Peres,et al. Bandits with switching costs: T2/3 regret , 2013, STOC.

[29] Alessandro Lazaric,et al. Efficient Bias-Span-Constrained Exploration-Exploitation in Reinforcement Learning , 2018, ICML.

[30] Elad Hazan,et al. Better Rates for Any Adversarial Deterministic MDP , 2013, ICML.

[31] Xiaoyu Chen,et al. Q-learning with UCB Exploration is Sample Efficient for Infinite-Horizon MDP , 2019, ICLR.

[32] Peter Auer,et al. The Nonstochastic Multiarmed Bandit Problem , 2002, SIAM J. Comput..

[33] András György,et al. The adversarial stochastic shortest path problem with unknown transition probabilities , 2012, AISTATS.