Batch Reinforcement Learning with a Nonparametric Off-Policy Policy Gradient

Off-policy Reinforcement Learning (RL) holds the promise of better data efficiency as it allows sample reuse and potentially enables safe interaction with the environment. Current off-policy policy gradient methods either suffer from high bias or high variance, delivering often unreliable estimates. The price of inefficiency becomes evident in real-world scenarios such as interaction-driven robot learning, where the success of RL has been rather limited, and a very high sample cost hinders straightforward application. In this paper, we propose a nonparametric Bellman equation, which can be solved in closed form. The solution is differentiable w.r.t the policy parameters and gives access to an estimation of the policy gradient. In this way, we avoid the high variance of importance sampling approaches, and the high bias of semi-gradient methods. We empirically analyze the quality of our gradient estimate against state-of-the-art methods, and show that it outperforms the baselines in terms of sample efficiency on classical control tasks.

[1]  Sergey Levine,et al.  Offline Reinforcement Learning: Tutorial, Review, and Perspectives on Open Problems , 2020, ArXiv.

[2]  Leonid Peshkin,et al.  Learning from Scarce Experience , 2002, ICML.

[3]  Peter L. Bartlett,et al.  Infinite-Horizon Policy-Gradient Estimation , 2001, J. Artif. Intell. Res..

[4]  Reuven Y. Rubinstein,et al.  Simulation and the Monte Carlo method , 1981, Wiley series in probability and mathematical statistics.

[5]  Shie Mannor,et al.  Reinforcement learning with Gaussian processes , 2005, ICML.

[6]  Alberto Bemporad,et al.  Predictive Control for Linear and Hybrid Systems , 2017 .

[7]  Martha White,et al.  An Off-policy Policy Gradient Theorem Using Emphatic Weightings , 2018, NeurIPS.

[8]  Peter Dayan,et al.  Q-learning , 1992, Machine Learning.

[9]  Ilya Kostrikov,et al.  AlgaeDICE: Policy Gradient from Arbitrary Experience , 2019, ArXiv.

[10]  Emma Brunskill,et al.  Off-Policy Policy Gradient with State Distribution Correction , 2019, UAI 2019.

[11]  Gavin Taylor,et al.  Kernelized value function approximation for reinforcement learning , 2009, ICML '09.

[12]  Liming Xiang,et al.  Kernel-Based Reinforcement Learning , 2006, ICIC.

[13]  Martha White,et al.  An Emphatic Approach to the Problem of Off-policy Temporal-Difference Learning , 2015, J. Mach. Learn. Res..

[14]  Jan Peters,et al.  A Nonparametric Off-Policy Policy Gradient , 2020, AISTATS.

[15]  Sergey Levine,et al.  Trust Region Policy Optimization , 2015, ICML.

[16]  Qiang Liu,et al.  Breaking the Curse of Horizon: Infinite-Horizon Off-Policy Estimation , 2018, NeurIPS.

[17]  Stergios B. Fotopoulos,et al.  All of Nonparametric Statistics , 2007, Technometrics.

[18]  Carl E. Rasmussen,et al.  PILCO: A Model-Based and Data-Efficient Approach to Policy Search , 2011, ICML.

[19]  Philip Thomas,et al.  Bias in Natural Actor-Critic Algorithms , 2014, ICML.

[20]  Yishay Mansour,et al.  Policy Gradient Methods for Reinforcement Learning with Function Approximation , 1999, NIPS.

[21]  Wojciech Zaremba,et al.  OpenAI Gym , 2016, ArXiv.

[22]  Marcello Restelli,et al.  Policy Optimization via Importance Sampling , 2018, NeurIPS.

[23]  E. Nadaraya On Estimating Regression , 1964 .

[24]  Oliver Kroemer,et al.  A Non-Parametric Approach to Dynamic Programming , 2011, NIPS.

[25]  Alec Radford,et al.  Proximal Policy Optimization Algorithms , 2017, ArXiv.

[26]  Jan Peters,et al.  An Upper Bound of the Bias of Nadaraya-Watson Kernel Regression under Lipschitz Assumptions , 2020, Stats.

[27]  Doina Precup,et al.  Off-Policy Deep Reinforcement Learning without Exploration , 2018, ICML.

[28]  Yuval Tassa,et al.  Continuous control with deep reinforcement learning , 2015, ICLR.

[29]  G. S. Watson,et al.  Smooth regression analysis , 1964 .

[30]  Andrea Bonarini,et al.  MushroomRL: Simplifying Reinforcement Learning Research , 2020, J. Mach. Learn. Res..

[31]  Shane Legg,et al.  Human-level control through deep reinforcement learning , 2015, Nature.

[32]  Pieter Abbeel,et al.  On a Connection between Importance Sampling and the Likelihood Ratio Policy Gradient , 2010, NIPS.

[33]  Craig Boutilier,et al.  Non-delusional Q-learning and value-iteration , 2018, NeurIPS.

[34]  Nicolas Meuleau,et al.  Exploration in Gradient-Based Reinforcement Learning , 2001 .

[35]  Guy Lever,et al.  Deterministic Policy Gradient Algorithms , 2014, ICML.

[36]  Xin Xu,et al.  Kernel-Based Least Squares Policy Iteration for Reinforcement Learning , 2007, IEEE Transactions on Neural Networks.

[37]  Yasemin Altun,et al.  Relative Entropy Policy Search , 2010 .

[38]  Pierre Geurts,et al.  Tree-Based Batch Mode Reinforcement Learning , 2005, J. Mach. Learn. Res..

[39]  Christian R. Shelton,et al.  Policy Improvement for POMDPs Using Normalized Importance Sampling , 2001, UAI.

[40]  Ronald J. Williams,et al.  Simple Statistical Gradient-Following Algorithms for Connectionist Reinforcement Learning , 2004, Machine Learning.

[41]  Jianqing Fan Design-adaptive Nonparametric Regression , 1992 .

[42]  Jan Peters,et al.  Policy Search for Motor Primitives , 2009, Künstliche Intell..

[43]  Philip S. Thomas,et al.  Is the Policy Gradient a Gradient? , 2019, AAMAS.

[44]  Jan Peters,et al.  Noname manuscript No. (will be inserted by the editor) Policy Search for Motor Primitives in Robotics , 2022 .

[45]  Martha White,et al.  Linear Off-Policy Actor-Critic , 2012, ICML.

[46]  Sergey Levine,et al.  Soft Actor-Critic: Off-Policy Maximum Entropy Deep Reinforcement Learning with a Stochastic Actor , 2018, ICML.

[47]  Sergey Levine,et al.  Deep Reinforcement Learning in a Handful of Trials using Probabilistic Dynamics Models , 2018, NeurIPS.

[48]  Martha White,et al.  Unifying Task Specification in Reinforcement Learning , 2016, ICML.

[49]  Martin A. Riedmiller Neural Fitted Q Iteration - First Experiences with a Data Efficient Neural Reinforcement Learning Method , 2005, ECML.

[50]  Leemon C. Baird,et al.  Residual Algorithms: Reinforcement Learning with Function Approximation , 1995, ICML.