Beyond the One Step Greedy Approach in Reinforcement Learning

The famous Policy Iteration algorithm alternates between policy improvement and policy evaluation. Implementations of this algorithm with several variants of the latter evaluation stage, e.g, $n$-step and trace-based returns, have been analyzed in previous works. However, the case of multiple-step lookahead policy improvement, despite the recent increase in empirical evidence of its strength, has to our knowledge not been carefully analyzed yet. In this work, we introduce the first such analysis. Namely, we formulate variants of multiple-step policy improvement, derive new algorithms using these definitions and prove their convergence. Moreover, we show that recent prominent Reinforcement Learning algorithms are, in fact, instances of our framework. We thus shed light on their empirical success and give a recipe for deriving new algorithms for future study.

[1]  Bruno Bouzy,et al.  Monte-Carlo Go Developments , 2003, ACG.

[2]  Bruno Scherrer,et al.  Performance bounds for λ policy iteration and application to the game of Tetris , 2013, J. Mach. Learn. Res..

[3]  Brian Sheppard,et al.  World-championship-caliber Scrabble , 2002, Artif. Intell..

[4]  Rémi Munos,et al.  Blazing the trails before beating the path: Sample-efficient Monte-Carlo planning , 2016, NIPS.

[5]  Dimitri P. Bertsekas,et al.  Temporal Dierences-Based Policy Iteration and Applications in Neuro-Dynamic Programming 1 , 1997 .

[6]  Alex Graves,et al.  Asynchronous Methods for Deep Reinforcement Learning , 2016, ICML.

[7]  Andrew Y. Ng,et al.  Shaping and policy search in reinforcement learning , 2003 .

[8]  Simon M. Lucas,et al.  A Survey of Monte Carlo Tree Search Methods , 2012, IEEE Transactions on Computational Intelligence and AI in Games.

[9]  Nan Jiang,et al.  The Dependence of Effective Planning Horizon on Model Accuracy , 2015, AAMAS.

[10]  Dimitri P. Bertsekas,et al.  Lambda-Policy Iteration: A Review and a New Implementation , 2013, ArXiv.

[11]  S. Ioffe,et al.  Temporal Differences-Based Policy Iteration and Applications in Neuro-Dynamic Programming , 1996 .

[12]  Richard S. Sutton,et al.  True Online TD(lambda) , 2014, ICML.

[13]  John Langford,et al.  Approximately Optimal Approximate Reinforcement Learning , 2002, ICML.

[14]  Bruno Scherrer,et al.  Improved and Generalized Upper Bounds on the Complexity of Policy Iteration , 2013, Math. Oper. Res..

[15]  Rémi Munos,et al.  Optimistic Planning of Deterministic Systems , 2008, EWRL.

[16]  M. Puterman,et al.  Modified Policy Iteration Algorithms for Discounted Markov Decision Problems , 1978 .

[17]  Matthieu Geist,et al.  Approximate modified policy iteration and its application to the game of Tetris , 2015, J. Mach. Learn. Res..

[18]  Gerald Tesauro,et al.  On-line Policy Improvement using Monte-Carlo Search , 1996, NIPS.

[19]  Martin L. Puterman,et al.  Markov Decision Processes: Discrete Stochastic Dynamic Programming , 1994 .

[20]  Damien Ernst,et al.  How to Discount Deep Reinforcement Learning: Towards New Dynamic Strategies , 2015, ArXiv.

[21]  Demis Hassabis,et al.  Mastering the game of Go with deep neural networks and tree search , 2016, Nature.

[22]  Marek Petrik,et al.  Biasing Approximate Dynamic Programming with a Lower Discount Factor , 2008, NIPS.

[23]  Demis Hassabis,et al.  Mastering the game of Go without human knowledge , 2017, Nature.

[24]  Vivek S. Borkar,et al.  Actor-Critic - Type Learning Algorithms for Markov Decision Processes , 1999, SIAM J. Control. Optim..

[25]  Demis Hassabis,et al.  Mastering Chess and Shogi by Self-Play with a General Reinforcement Learning Algorithm , 2017, ArXiv.

[26]  John N. Tsitsiklis,et al.  Neuro-dynamic programming: an overview , 1995, Proceedings of 1995 34th IEEE Conference on Decision and Control.

[27]  B. Scherrer,et al.  Performance bound for Approximate Optimistic Policy Iteration , 2010 .

[28]  Rémi Munos,et al.  Optimistic Planning in Markov Decision Processes Using a Generative Model , 2014, NIPS.

[29]  Richard S. Sutton,et al.  Learning to predict by the methods of temporal differences , 1988, Machine Learning.

[30]  Rémi Munos,et al.  From Bandits to Monte-Carlo Tree Search: The Optimistic Principle Applied to Optimization and Planning , 2014, Found. Trends Mach. Learn..

[31]  Lucian Busoniu,et al.  Optimistic planning for Markov decision processes , 2012, AISTATS.

[32]  Bruno Scherrer,et al.  Performance Bounds for Lambda Policy Iteration and Application to the Game of Tetris , 2007 .

[33]  Sergey Levine,et al.  High-Dimensional Continuous Control Using Generalized Advantage Estimation , 2015, ICLR.

[34]  Joel Veness,et al.  Bootstrapping from Game Tree Search , 2009, NIPS.

[35]  Sergey Levine,et al.  Trust Region Policy Optimization , 2015, ICML.

[36]  Lihong Li,et al.  Reinforcement Learning in Finite MDPs: PAC Analysis , 2009, J. Mach. Learn. Res..