Adaptive Trust Region Policy Optimization: Global Convergence and Faster Rates for Regularized MDPs

Trust region policy optimization (TRPO) is a popular and empirically successful policy search algorithm in Reinforcement Learning (RL) in which a surrogate problem, that restricts consecutive policies to be 'close' to one another, is iteratively solved. Nevertheless, TRPO has been considered a heuristic algorithm inspired by Conservative Policy Iteration (CPI). We show that the adaptive scaling mechanism used in TRPO is in fact the natural "RL version" of traditional trust-region methods from convex analysis. We first analyze TRPO in the planning setting, in which we have access to the model and the entire state space. Then, we consider sample-based TRPO and establish $\tilde O(1/\sqrt{N})$ convergence rate to the global optimum. Importantly, the adaptive scaling mechanism allows us to analyze TRPO in regularized MDPs for which we prove fast rates of $\tilde O(1/N)$, much like results in convex optimization. This is the first result in RL of better rates when regularizing the instantaneous cost or reward.

[1]  Luca Bascetta,et al.  Adaptive Step-Size for Policy Gradient Methods , 2013, NIPS.

[2]  Angelia Nedic,et al.  On Stochastic Subgradient Mirror-Descent Algorithm with Weighted Averaging , 2013, SIAM J. Optim..

[3]  Richard S. Sutton,et al.  Reinforcement Learning: An Introduction , 1998, IEEE Trans. Neural Networks.

[4]  Sham M. Kakade,et al.  On the sample complexity of reinforcement learning. , 2003 .

[5]  Sergey Levine,et al.  Trust Region Policy Optimization , 2015, ICML.

[6]  A. Juditsky,et al.  5 First-Order Methods for Nonsmooth Convex Large-Scale Optimization , I : General Purpose Methods , 2010 .

[7]  Matthieu Geist,et al.  Local Policy Search in a Convex Space and Conservative Policy Iteration as Boosted Policy Search , 2014, ECML/PKDD.

[8]  John Langford,et al.  Approximately Optimal Approximate Reinforcement Learning , 2002, ICML.

[9]  Roy Fox,et al.  Taming the Noise in Reinforcement Learning via Soft Updates , 2015, UAI.

[10]  Shalabh Bhatnagar,et al.  Natural actor-critic algorithms , 2009, Autom..

[11]  Nicolas Le Roux,et al.  Understanding the impact of entropy on policy optimization , 2018, ICML.

[12]  Alex Graves,et al.  Asynchronous Methods for Deep Reinforcement Learning , 2016, ICML.

[13]  Marcello Restelli,et al.  Smoothing policies and safe policy gradients , 2019, Machine Learning.

[14]  Bruno Scherrer,et al.  Approximate Policy Iteration Schemes: A Comparison , 2014, ICML.

[15]  Jalaj Bhandari,et al.  Global Optimality Guarantees For Policy Gradient Methods , 2019, ArXiv.

[16]  Qi Cai,et al.  Neural Proximal/Trust Region Policy Optimization Attains Globally Optimal Policy , 2019, ArXiv.

[17]  Dale Schuurmans,et al.  Trust-PCL: An Off-Policy Trust Region Method for Continuous Control , 2017, ICLR.

[18]  Matthieu Geist,et al.  A Theory of Regularized Markov Decision Processes , 2019, ICML.

[19]  Ofir Nachum,et al.  Path Consistency Learning in Tsallis Entropy Regularized MDPs , 2018, ICML.

[21]  Le Song,et al.  SBEED: Convergent Reinforcement Learning with Nonlinear Function Approximation , 2017, ICML.

[22]  Yishay Mansour,et al.  Policy Gradient Methods for Reinforcement Learning with Function Approximation , 1999, NIPS.

[23]  Sham M. Kakade,et al.  On the Theory of Policy Gradient Methods: Optimality, Approximation, and Distribution Shift , 2019, J. Mach. Learn. Res..

[24]  Vicenç Gómez,et al.  A unified view of entropy-regularized Markov decision processes , 2017, ArXiv.

[25]  Alec Radford,et al.  Proximal Policy Optimization Algorithms , 2017, ArXiv.

[26]  G. G. Stokes "J." , 1890, The New Yale Book of Quotations.

[27]  Marc Teboulle,et al.  Mirror descent and nonlinear projected subgradient methods for convex optimization , 2003, Oper. Res. Lett..