Average-Reward Learning and Planning with Options

We extend the options framework for temporal abstraction in reinforcement learning from discounted Markov decision processes (MDPs) to average-reward MDPs. Our contributions include general convergent off-policy inter-option learning algorithms, intra-option algorithms for learning values and models, as well as samplebased planning variants of our learning algorithms. Our algorithms and convergence proofs extend those recently developed by Wan, Naik, and Sutton. We also extend the notion of option-interrupting behavior from the discounted to the average-reward formulation. We show the efficacy of the proposed algorithms with experiments on a continuing version of the Four-Room domain.

[1]  Fang Cao,et al.  RVI reinforcement learning for semi-Markov decision processes with average reward , 2010, 2010 8th World Congress on Intelligent Control and Automation.

[2]  Junhyuk Oh,et al.  Discovery of Options via Meta-Learned Subgoals , 2021, NeurIPS.

[3]  Shie Mannor,et al.  Q-Cut - Dynamic Discovery of Sub-goals in Reinforcement Learning , 2002, ECML.

[4]  Marlos C. Machado,et al.  A Laplacian Framework for Option Discovery in Reinforcement Learning , 2017, ICML.

[5]  Doina Precup,et al.  Between MDPs and Semi-MDPs: A Framework for Temporal Abstraction in Reinforcement Learning , 1999, Artif. Intell..

[6]  Pieter Abbeel,et al.  Variational Option Discovery Algorithms , 2018, ArXiv.

[7]  Nuttapong Chentanez,et al.  Intrinsically Motivated Reinforcement Learning , 2004, NIPS.

[8]  Vivek S. Borkar,et al.  Learning Algorithms for Markov Decision Processes with Average Cost , 2001, SIAM J. Control. Optim..

[9]  V. Borkar Asynchronous Stochastic Approximations , 1998 .

[10]  S. Mahadevan,et al.  Solving Semi-Markov Decision Problems Using Average Reward Reinforcement Learning , 1999 .

[11]  P. Schweitzer Iterative solution of the functional equations of undiscounted Markov renewal programming , 1971 .

[12]  Paul J. Schweitzer,et al.  The Functional Equations of Undiscounted Markov Renewal Programming , 1971, Math. Oper. Res..

[13]  Alessandro Lazaric,et al.  Exploration – Exploitation in MDPs with Options , 2016 .

[14]  Sergey Levine,et al.  Diversity is All You Need: Learning Skills without a Reward Function , 2018, ICLR.

[15]  Shalabh Bhatnagar,et al.  Universal Option Models , 2014, NIPS.

[16]  Shimon Whiteson,et al.  Average-Reward Off-Policy Policy Evaluation with Function Approximation , 2021, ICML.

[17]  Andrew G. Barto,et al.  Using relative novelty to identify useful temporal abstractions in reinforcement learning , 2004, ICML.

[18]  Richard S. Sutton,et al.  Learning and Planning in Average-Reward Markov Decision Processes , 2020, ICML.

[19]  Lihong Li,et al.  PAC-inspired Option Discovery in Lifelong Reinforcement Learning , 2014, ICML.

[20]  Doina Precup,et al.  The Option-Critic Architecture , 2016, AAAI.

[21]  Richard S. Sutton,et al.  Reinforcement Learning: An Introduction , 1998, IEEE Trans. Neural Networks.

[22]  Daniel Polani,et al.  Grounding subgoals in information transitions , 2011, 2011 IEEE Symposium on Adaptive Dynamic Programming and Reinforcement Learning (ADPRL).

[23]  Martin L. Puterman,et al.  Markov Decision Processes: Discrete Stochastic Dynamic Programming , 1994 .

[24]  V. Borkar,et al.  An analog scheme for fixed point computation. I. Theory , 1997 .

[25]  John N. Tsitsiklis,et al.  Asynchronous Stochastic Approximation and Q-Learning , 1994, Machine Learning.

[26]  Abhijit Gosavi,et al.  Reinforcement learning for long-run average cost , 2004, Eur. J. Oper. Res..

[27]  Vivek S. Borkar,et al.  An analog scheme for fixed-point computation-Part II: Applications , 1999 .

[28]  TaeChoong Chung,et al.  Policy Gradient Semi-markov Decision Process , 2008, 2008 20th IEEE International Conference on Tools with Artificial Intelligence.

[29]  Satinder P. Singh,et al.  Linear options , 2010, AAMAS.

[30]  Daan Wierstra,et al.  Variational Intrinsic Control , 2016, ICLR.

[31]  V. Borkar Stochastic Approximation: A Dynamical Systems Viewpoint , 2008, Texts and Readings in Mathematics.

[32]  Andrew G. Barto,et al.  Automatic Discovery of Subgoals in Reinforcement Learning using Diverse Density , 2001, ICML.