Reward-Weighted Regression Converges to a Global Optimum

Reward-Weighted Regression (RWR) belongs to a family of widely known iterative Reinforcement Learning algorithms based on the Expectation-Maximization framework. In this family, learning at each iteration consists of sampling a batch of trajectories using the current policy and fitting a new policy to maximize a return-weighted log-likelihood of actions. Although RWR is known to yield monotonic improvement of the policy under certain circumstances, whether and under which conditions RWR converges to the optimal policy have remained open questions. In this paper, we provide for the first time a proof that RWR converges to a global optimum when no function approximation is used, in a general compact setting. Furthermore, for the simpler case with finite state and action spaces we prove R-linear convergence of the state-value function to the optimum.

[1]  Marcin Andrychowicz,et al.  Multi-Goal Reinforcement Learning: Challenging Robotics Environments and Request for Research , 2018, ArXiv.

[2]  Shalabh Bhatnagar,et al.  Fast gradient-descent methods for temporal-difference learning with linear function approximation , 2009, ICML '09.

[3]  W. Rudin Principles of mathematical analysis , 1964 .

[4]  Masashi Sugiyama,et al.  Reward-Weighted Regression with Sample Reuse for Direct Policy Search in Reinforcement Learning , 2011, Neural Computation.

[5]  Sergey Levine,et al.  Advantage-Weighted Regression: Simple and Scalable Off-Policy Reinforcement Learning , 2019, ArXiv.

[6]  Martin L. Puterman,et al.  Markov Decision Processes: Discrete Stochastic Dynamic Programming , 1994 .

[7]  Csaba Szepesvári,et al.  Fitted Q-iteration in continuous action-space MDPs , 2007, NIPS.

[8]  Yoshinobu Kawahara,et al.  Weighted Likelihood Policy Search with Model Selection , 2012, NIPS.

[9]  Masashi Sugiyama,et al.  Efficient Sample Reuse in EM-Based Policy Search , 2009, ECML/PKDD.

[10]  V. Borkar Stochastic Approximation: A Dynamical Systems Viewpoint , 2008, Texts and Readings in Mathematics.

[11]  Tom Schaul,et al.  Episodic Reinforcement Learning by Logistic Reward-Weighted Regression , 2008, ICANN.

[12]  R. Bass Convergence of probability measures , 2011 .

[13]  Doina Precup,et al.  Between MDPs and Semi-MDPs: A Framework for Temporal Abstraction in Reinforcement Learning , 1999, Artif. Intell..

[14]  C. Malsburg Self-organization of orientation sensitive cells in the striate cortex , 2004, Kybernetik.

[15]  Martin A. Riedmiller Neural Fitted Q Iteration - First Experiences with a Data Efficient Neural Reinforcement Learning Method , 2005, ECML.

[16]  Stefan Schaal,et al.  Learning to Control in Operational Space , 2008, Int. J. Robotics Res..

[17]  R. L. Stratonovich CONDITIONAL MARKOV PROCESSES , 1960 .

[18]  Masashi Sugiyama,et al.  Hierarchical Policy Search via Return-Weighted Density Estimation , 2017, AAAI.

[19]  Yasemin Altun,et al.  Relative Entropy Policy Search , 2010 .

[20]  Yishay Mansour,et al.  Policy Gradient Methods for Reinforcement Learning with Function Approximation , 1999, NIPS.

[21]  Gerhard Neumann,et al.  Variational Inference for Policy Search in changing situations , 2011, ICML.

[22]  Stefan Schaal,et al.  Reinforcement learning by reward-weighted regression for operational space control , 2007, ICML '07.

[23]  Pierre Geurts,et al.  Tree-Based Batch Mode Reinforcement Learning , 2005, J. Mach. Learn. Res..

[24]  New York Dover,et al.  ON THE CONVERGENCE PROPERTIES OF THE EM ALGORITHM , 1983 .

[25]  Richard S. Sutton,et al.  Reinforcement Learning: An Introduction , 1998, IEEE Trans. Neural Networks.

[26]  D. Rubin,et al.  Maximum likelihood from incomplete data via the EM - algorithm plus discussions on the paper , 1977 .

[27]  Geoffrey E. Hinton,et al.  Using Expectation-Maximization for Reinforcement Learning , 1997, Neural Computation.

[28]  Richard S. Sutton,et al.  A Convergent O(n) Temporal-difference Algorithm for Off-policy Learning with Linear Function Approximation , 2008, NIPS.

[29]  Yuval Tassa,et al.  Relative Entropy Regularized Policy Iteration , 2018, ArXiv.

[30]  Sameera S. Ponda,et al.  Autonomous navigation of stratospheric balloons using reinforcement learning , 2020, Nature.

[31]  S. Ana,et al.  Topology , 2018, International Journal of Mathematics Trends and Technology.

[32]  R. Taylor A User's Guide to Measure-Theoretic Probability , 2003 .

[33]  Tom Schaul,et al.  Fitness Expectation Maximization , 2008, PPSN.

[34]  Demis Hassabis,et al.  Mastering the game of Go with deep neural networks and tree search , 2016, Nature.

[35]  Jan Peters,et al.  Fitted Q-iteration by Advantage Weighted Regression , 2008, NIPS.

[36]  Jan Peters,et al.  Noname manuscript No. (will be inserted by the editor) Policy Search for Motor Primitives in Robotics , 2022 .

[37]  Yuval Tassa,et al.  Maximum a Posteriori Policy Optimisation , 2018, ICLR.