Stochastic Shortest Path with Adversarially Changing Costs

Stochastic shortest path (SSP) is a well-known problem in planning and control, in which an agent has to reach a goal state in minimum total expected cost. In this paper we consider adversarial SSPs that also account for adversarial changes in the costs over time, while the dynamics (i.e., transition function) remains unchanged. Formally, an agent interacts with an SSP environment for $K$ episodes, the cost function changes arbitrarily between episodes, and the fixed dynamics are unknown to the agent. We give high probability regret bounds of $\widetilde O (\sqrt{K})$ assuming all costs are strictly positive, and $\widetilde O (K^{3/4})$ for the general case. To the best of our knowledge, we are the first to consider this natural setting of adversarial SSP and obtain sub-linear regret for it.

[1]  Yishay Mansour,et al.  Online Convex Optimization in Adversarial Markov Decision Processes , 2019, ICML.

[2]  Csaba Szepesvári,et al.  Online Markov Decision Processes Under Bandit Feedback , 2010, IEEE Transactions on Automatic Control.

[3]  A. S. Manne Linear Programming and Sequential Decisions , 1960 .

[4]  Stephen P. Boyd,et al.  Convex Optimization , 2004, Algorithms and Theory of Computation Handbook.

[5]  Alessandro Lazaric,et al.  No-Regret Exploration in Goal-Oriented Reinforcement Learning , 2019, ICML.

[6]  Peter Auer,et al.  Near-optimal Regret Bounds for Reinforcement Learning , 2008, J. Mach. Learn. Res..

[7]  Benjamin Van Roy,et al.  Generalization and Exploration via Randomized Value Functions , 2014, ICML.

[8]  Alessandro Lazaric,et al.  Efficient Bias-Span-Constrained Exploration-Exploitation in Reinforcement Learning , 2018, ICML.

[9]  Haipeng Luo,et al.  Learning Adversarial MDPs with Bandit Feedback and Unknown Transition , 2019, ArXiv.

[10]  Shie Mannor,et al.  Tight Regret Bounds for Model-Based Reinforcement Learning with Greedy Policies , 2019, NeurIPS.

[11]  John N. Tsitsiklis,et al.  An Analysis of Stochastic Shortest Path Problems , 1991, Math. Oper. Res..

[12]  Yishay Mansour,et al.  Online Markov Decision Processes , 2009, Math. Oper. Res..

[13]  Ambuj Tewari,et al.  REGAL: A Regularization based Algorithm for Reinforcement Learning in Weakly Communicating MDPs , 2009, UAI.

[14]  Yishay Mansour,et al.  Online Stochastic Shortest Path with Bandit Feedback and Unknown Transition Function , 2019, NeurIPS.

[15]  Gergely Neu,et al.  Online learning in episodic Markovian decision processes by relative entropy policy search , 2013, NIPS.

[16]  Peter Auer,et al.  The Nonstochastic Multiarmed Bandit Problem , 2002, SIAM J. Comput..

[17]  András György,et al.  The adversarial stochastic shortest path problem with unknown transition probabilities , 2012, AISTATS.

[18]  Michael I. Jordan,et al.  Is Q-learning Provably Efficient? , 2018, NeurIPS.

[19]  Chi Jin,et al.  Provably Efficient Exploration in Policy Optimization , 2020, ICML.

[20]  F. d'Epenoux,et al.  A Probabilistic Production and Inventory Problem , 1963 .

[21]  Rémi Munos,et al.  Minimax Regret Bounds for Reinforcement Learning , 2017, ICML.

[22]  Emma Brunskill,et al.  Tighter Problem-Dependent Regret Bounds in Reinforcement Learning without Domain Knowledge using Value Function Bounds , 2019, ICML.

[23]  Shie Mannor,et al.  Optimistic Policy Optimization with Bandit Feedback , 2020, ICML.

[24]  Tor Lattimore,et al.  Unifying PAC and Regret: Uniform PAC Bounds for Episodic Reinforcement Learning , 2017, NIPS.

[25]  Haim Kaplan,et al.  Near-optimal Regret Bounds for Stochastic Shortest Path , 2020, ICML.