Reinforcement Learning with Feedback Graphs

We study episodic reinforcement learning in Markov decision processes when the agent receives additional feedback per step in the form of several transition observations. Such additional observations are available in a range of tasks through extended sensors or prior knowledge about the environment (e.g., when certain actions yield similar outcome). We formalize this setting using a feedback graph over state-action pairs and show that model-based algorithms can leverage the additional feedback for more sample-efficient learning. We give a regret bound that, ignoring logarithmic factors and lower-order terms, depends only on the size of the maximum acyclic subgraph of the feedback graph, in contrast with a polynomial dependency on the number of states and actions in the absence of a feedback graph. Finally, we highlight challenges when leveraging a small dominating set of the feedback graph as compared to the bandit setting and propose a new algorithm that can use knowledge of such a dominating set for more sample-efficient learning of a near-optimal policy.

[1]  Tor Lattimore,et al.  PAC Bounds for Discounted MDPs , 2012, ALT.

[2]  Michael I. Jordan,et al.  Is Q-learning Provably Efficient? , 2018, NeurIPS.

[3]  Emma Brunskill,et al.  Tighter Problem-Dependent Regret Bounds in Reinforcement Learning without Domain Knowledge using Value Function Bounds , 2019, ICML.

[4]  Zhengyuan Zhou,et al.  Provably Efficient Reinforcement Learning with Aggregated States , 2019, ArXiv.

[5]  Max Simchowitz,et al.  Non-Asymptotic Gap-Dependent Regret Bounds for Tabular MDPs , 2019, NeurIPS.

[6]  Nan Jiang,et al.  Contextual Decision Processes with low Bellman rank are PAC-Learnable , 2016, ICML.

[7]  Hongyi Zhang,et al.  mixup: Beyond Empirical Risk Minimization , 2017, ICLR.

[8]  John N. Tsitsiklis,et al.  The Sample Complexity of Exploration in the Multi-Armed Bandit Problem , 2004, J. Mach. Learn. Res..

[9]  Paul Weng,et al.  Towards More Sample Efficiency in Reinforcement Learning with Data Augmentation , 2019, ArXiv.

[10]  Noga Alon,et al.  From Bandits to Experts: A Tale of Domination and Independence , 2013, NIPS.

[11]  Akshay Krishnamurthy,et al.  Reward-Free Exploration for Reinforcement Learning , 2020, ICML.

[12]  Jon D. McAuliffe,et al.  Time-uniform, nonparametric, nonasymptotic confidence sequences , 2018, The Annals of Statistics.

[13]  Hilbert J. Kappen,et al.  On the Sample Complexity of Reinforcement Learning with a Generative Model , 2012, ICML.

[14]  Christoph Dann,et al.  Strategic Exploration in Reinforcement Learning - New Algorithms and Learning Guarantees , 2020 .

[15]  Jon D. McAuliffe,et al.  Uniform, nonparametric, non-asymptotic confidence sequences , 2018 .

[16]  Michael I. Jordan,et al.  Provably Efficient Reinforcement Learning with Linear Function Approximation , 2019, COLT.

[17]  Shie Mannor,et al.  From Bandits to Experts: On the Value of Side-Observations , 2011, NIPS.

[18]  Tor Lattimore,et al.  Unifying PAC and Regret: Uniform PAC Bounds for Episodic Reinforcement Learning , 2017, NIPS.

[19]  Nan Jiang,et al.  Provably efficient RL with Rich Observations via Latent State Decoding , 2019, ICML.

[20]  Ilya Kostrikov,et al.  Image Augmentation Is All You Need: Regularizing Deep Reinforcement Learning from Pixels , 2020, ArXiv.

[21]  Pieter Abbeel,et al.  Reinforcement Learning with Augmented Data , 2020, NeurIPS.

[22]  Lihong Li,et al.  Policy Certificates: Towards Accountable Reinforcement Learning , 2018, ICML.

[23]  Thodoris Lykouris,et al.  Graph regret bounds for Thompson Sampling and UCB , 2019, ALT.

[24]  Nan Jiang,et al.  On Oracle-Efficient PAC Reinforcement Learning with Rich Observations , 2018 .

[25]  Nan Jiang,et al.  On Oracle-Efficient PAC RL with Rich Observations , 2018, NeurIPS.

[26]  Geoffrey E. Hinton,et al.  ImageNet classification with deep convolutional neural networks , 2012, Commun. ACM.

[27]  Rémi Munos,et al.  Minimax Regret Bounds for Reinforcement Learning , 2017, ICML.

[28]  Christoph Dann,et al.  Sample Complexity of Episodic Fixed-Horizon Reinforcement Learning , 2015, NIPS.

[29]  Marc Lelarge,et al.  Leveraging Side Observations in Stochastic Bandits , 2012, UAI.

[30]  Tamir Hazan,et al.  Online Learning with Feedback Graphs Without the Graphs , 2016, ICML 2016.

[31]  Atilla Eryilmaz,et al.  Stochastic bandits with side observations on networks , 2014, SIGMETRICS '14.

[32]  Shane Legg,et al.  Human-level control through deep reinforcement learning , 2015, Nature.

[33]  Philip Bachman,et al.  Deep Reinforcement Learning that Matters , 2017, AAAI.

[34]  Craig Boutilier,et al.  RecSim: A Configurable Simulation Platform for Recommender Systems , 2019, ArXiv.

[35]  Claudio Gentile,et al.  Online Learning with Abstention , 2017, ICML.

[36]  Doina Precup,et al.  Between MDPs and Semi-MDPs: A Framework for Temporal Abstraction in Reinforcement Learning , 1999, Artif. Intell..

[37]  Craig Boutilier,et al.  Decision-Theoretic Planning: Structural Assumptions and Computational Leverage , 1999, J. Artif. Intell. Res..

[38]  Geoffrey J. Gordon,et al.  A Reduction of Imitation Learning and Structured Prediction to No-Regret Online Learning , 2010, AISTATS.

[39]  Claudio Gentile,et al.  Online Learning with Sleeping Experts and Feedback Graphs , 2019, ICML.

[40]  Michal Valko,et al.  Online Learning with Noisy Side Observations , 2016, AISTATS.

[41]  Massimiliano Pontil,et al.  Empirical Bernstein Bounds and Sample-Variance Penalization , 2009, COLT.

[42]  Sham M. Kakade,et al.  Optimality and Approximation with Policy Gradient Methods in Markov Decision Processes , 2019, COLT.

[43]  Christoph Dann,et al.  Sample Efficient Policy Search for Optimal Stopping Domains , 2017, IJCAI.

[44]  Benjamin Van Roy,et al.  On Lower Bounds for Regret in Reinforcement Learning , 2016, ArXiv.

[45]  Demis Hassabis,et al.  Mastering the game of Go without human knowledge , 2017, Nature.

[46]  Mehryar Mohri,et al.  Bandits with Feedback Graphs and Switching Costs , 2019, NeurIPS.