Average-Reward Off-Policy Policy Evaluation with Function Approximation

We consider off-policy policy evaluation with function approximation (FA) in average-reward MDPs, where the goal is to estimate both the reward rate and the differential value function. For this problem, bootstrapping is necessary and, along with off-policy learning and FA, results in the deadly triad (Sutton & Barto, 2018). To address the deadly triad, we propose two novel algorithms, reproducing the celebrated success of Gradient TD algorithms in the average-reward setting. In terms of estimating the differential value function, the algorithms are the first convergent off-policy linear function approximation algorithms. In terms of estimating the reward rate, the algorithms are the first convergent offpolicy linear function approximation algorithms that do not require estimating the density ratio. We demonstrate empirically the advantage of the proposed algorithms, as well as their nonlinear variants, over a competitive density-ratio-based approach, in a simple domain as well as challenging robot simulation tasks.

[1]  Richard S. Sutton,et al.  Integrated Architectures for Learning, Planning, and Reacting Based on Approximating Dynamic Programming , 1990, ML.

[2]  Shimon Whiteson,et al.  GradientDICE: Rethinking Generalized Offline Estimation of Stationary Values , 2020, ICML.

[3]  Ronald A. Howard,et al.  Dynamic Programming and Markov Processes , 1960 .

[4]  Shane Legg,et al.  Human-level control through deep reinforcement learning , 2015, Nature.

[5]  Philip Bachman,et al.  Deep Reinforcement Learning that Matters , 2017, AAAI.

[6]  Donald E. Kirk,et al.  Optimal control theory : an introduction , 1970 .

[7]  Ali H. Sayed,et al.  Distributed Policy Evaluation Under Multiple Behavior Strategies , 2013, IEEE Transactions on Automatic Control.

[8]  Richard S. Sutton,et al.  A Convergent O(n) Temporal-difference Algorithm for Off-policy Learning with Linear Function Approximation , 2008, NIPS.

[9]  Sergey Levine,et al.  Deep Reinforcement Learning in a Handful of Trials using Probabilistic Dynamics Models , 2018, NeurIPS.

[10]  Martin L. Puterman,et al.  Markov Decision Processes: Discrete Stochastic Dynamic Programming , 1994 .

[11]  Le Song,et al.  SBEED: Convergent Reinforcement Learning with Nonlinear Function Approximation , 2017, ICML.

[12]  Shimon Whiteson,et al.  Provably Convergent Two-Timescale Off-Policy Actor-Critic with Function Approximation , 2019, ICML.

[13]  Herke van Hoof,et al.  Addressing Function Approximation Error in Actor-Critic Methods , 2018, ICML.

[14]  Yao Liu,et al.  Combining Parametric and Nonparametric Models for Off-Policy Evaluation , 2019, ICML.

[15]  Lihong Li,et al.  Stochastic Variance Reduction Methods for Policy Evaluation , 2017, ICML.

[16]  Gabriel Dulac-Arnold,et al.  Challenges of Real-World Reinforcement Learning , 2019, ArXiv.

[17]  Qiang Liu,et al.  Black-box Off-policy Estimation for Infinite-Horizon Reinforcement Learning , 2020, ICLR.

[18]  Ilya Kostrikov,et al.  AlgaeDICE: Policy Gradient from Arbitrary Experience , 2019, ArXiv.

[19]  Thorsten Joachims,et al.  MOReL : Model-Based Offline Reinforcement Learning , 2020, NeurIPS.

[20]  Yifei Ma,et al.  Optimal Off-Policy Evaluation for Reinforcement Learning with Marginalized Importance Sampling , 2019, NeurIPS.

[21]  Alborz Geramifard,et al.  Dyna-Style Planning with Linear Function Approximation and Prioritized Sweeping , 2008, UAI.

[22]  Yao Liu,et al.  Representation Balancing MDPs for Off-Policy Policy Evaluation , 2018, NeurIPS.

[23]  Martha White,et al.  Planning with Expectation Models , 2019, IJCAI.

[24]  Panos M. Pardalos,et al.  Approximate dynamic programming: solving the curses of dimensionality , 2009, Optim. Methods Softw..

[25]  Lantao Yu,et al.  MOPO: Model-based Offline Policy Optimization , 2020, NeurIPS.

[26]  Shie Mannor,et al.  Consistent On-Line Off-Policy Evaluation , 2017, ICML.

[27]  J. Zico Kolter,et al.  The Fixed Points of Off-Policy TD , 2011, NIPS.

[28]  Richard S. Sutton,et al.  Learning and Planning in Average-Reward Markov Decision Processes , 2020, ArXiv.

[29]  Marek Petrik,et al.  Finite-Sample Analysis of Proximal Gradient TD Algorithms , 2015, UAI.

[30]  Philip S. Thomas,et al.  High-Confidence Off-Policy Evaluation , 2015, AAAI.

[31]  Qiang Liu,et al.  Doubly Robust Bias Reduction in Infinite Horizon Off-Policy Estimation , 2019, ICLR.

[32]  Shimon Whiteson,et al.  Mean-Variance Policy Iteration for Risk-Averse Reinforcement Learning , 2020, AAAI.

[33]  Emma Brunskill,et al.  Off-Policy Policy Gradient with State Distribution Correction , 2019, UAI 2019.

[34]  Huizhen Yu,et al.  On Convergence of some Gradient-based Temporal-Differences Algorithms for Off-Policy Learning , 2017, ArXiv.

[35]  Demis Hassabis,et al.  Mastering the game of Go with deep neural networks and tree search , 2016, Nature.

[36]  Justin A. Boyan,et al.  Least-Squares Temporal Difference Learning , 1999, ICML.

[37]  Masatoshi Uehara,et al.  Minimax Weight and Q-Function Learning for Off-Policy Evaluation , 2019, ICML.

[38]  Geoffrey E. Hinton,et al.  Rectified Linear Units Improve Restricted Boltzmann Machines , 2010, ICML.

[39]  Lihong Li A perspective on off-policy evaluation in reinforcement learning , 2019, Frontiers of Computer Science.

[40]  Bo Liu,et al.  Proximal Reinforcement Learning: A New Theory of Sequential Decision Making in Primal-Dual Spaces , 2014, ArXiv.

[41]  Qiang Liu,et al.  Breaking the Curse of Horizon: Infinite-Horizon Off-Policy Estimation , 2018, NeurIPS.

[42]  Richard S. Sutton,et al.  Reinforcement Learning: An Introduction , 1998, IEEE Trans. Neural Networks.

[43]  Bo Liu,et al.  A Block Coordinate Ascent Algorithm for Mean-Variance Optimization , 2018, NeurIPS.

[44]  Bo Dai,et al.  GenDICE: Generalized Offline Estimation of Stationary Values , 2020, ICLR.

[45]  Martha White,et al.  An Emphatic Approach to the Problem of Off-policy Temporal-Difference Learning , 2015, J. Mach. Learn. Res..

[46]  Peter L. Bartlett,et al.  POLITEX: Regret Bounds for Policy Iteration using Expert Prediction , 2019, ICML.

[47]  John N. Tsitsiklis,et al.  Actor-Critic Algorithms , 1999, NIPS.

[48]  Leemon C. Baird,et al.  Residual Algorithms: Reinforcement Learning with Function Approximation , 1995, ICML.

[49]  Dilan Görür,et al.  A maximum-entropy approach to off-policy evaluation in average-reward MDPs , 2020, NeurIPS.

[50]  Dimitri P. Bertsekas,et al.  Convergence Results for Some Temporal Difference Methods Based on Least Squares , 2009, IEEE Transactions on Automatic Control.

[51]  Bo Dai,et al.  DualDICE: Behavior-Agnostic Estimation of Discounted Stationary Distribution Corrections , 2019, NeurIPS.

[52]  Marc G. Bellemare,et al.  Off-Policy Deep Reinforcement Learning by Bootstrapping the Covariate Shift , 2019, AAAI.

[53]  Shalabh Bhatnagar,et al.  Fast gradient-descent methods for temporal-difference learning with linear function approximation , 2009, ICML '09.

[54]  V. Borkar Stochastic Approximation: A Dynamical Systems Viewpoint , 2008, Texts and Readings in Mathematics.

[55]  Nan Jiang,et al.  Doubly Robust Off-policy Value Evaluation for Reinforcement Learning , 2015, ICML.

[56]  Shalabh Bhatnagar,et al.  A Convergent Off-Policy Temporal Difference Algorithm , 2019, ECAI.

[57]  Peter Norvig,et al.  Artificial Intelligence: A Modern Approach , 1995 .