How to Combine Tree-Search Methods in Reinforcement Learning

Finite-horizon lookahead policies are abundantly used in Reinforcement Learning and demonstrate impressive empirical success. Usually, the lookahead policies are implemented with specific planning methods such as Monte Carlo Tree Search (e.g. in AlphaZero). Referring to the planning problem as tree search, a reasonable practice in these implementations is to back up the value only at the leaves while the information obtained at the root is not leveraged other than for updating the policy. Here, we question the potency of this approach. Namely, the latter procedure is non-contractive in general, and its convergence is not guaranteed. Our proposed enhancement is straightforward and simple: use the return from the optimal tree path to back up the values at the descendants of the root. This leads to a $\gamma^h$-contracting procedure, where $\gamma$ is the discount factor and $h$ is the tree depth. To establish our results, we first introduce a notion called \emph{multiple-step greedy consistency}. We then provide convergence rates for two algorithmic instantiations of the above enhancement in the presence of noise injected to both the tree search stage and value estimation stage.

[1]  Shie Mannor,et al.  Multiple-Step Greedy Policies in Online and Approximate Reinforcement Learning , 2018, NIPS 2018.

[2]  Andrew Tridgell,et al.  TDLeaf(lambda): Combining Temporal Difference Learning with Game-Tree Search , 1999, ArXiv.

[3]  M. Puterman,et al.  Modified Policy Iteration Algorithms for Discounted Markov Decision Problems , 1978 .

[4]  Simon M. Lucas,et al.  A Survey of Monte Carlo Tree Search Methods , 2012, IEEE Transactions on Computational Intelligence and AI in Games.

[5]  Shie Mannor,et al.  Beyond the One Step Greedy Approach in Reinforcement Learning , 2018, ICML.

[6]  Tsuyoshi Murata,et al.  {m , 1934, ACML.

[7]  Demis Hassabis,et al.  Mastering the game of Go without human knowledge , 2017, Nature.

[8]  Danna Zhou,et al.  d. , 1934, Microbial pathogenesis.

[9]  Joel Veness,et al.  Bootstrapping from Game Tree Search , 2009, NIPS.

[10]  Alex Graves,et al.  Asynchronous Methods for Deep Reinforcement Learning , 2016, ICML.

[11]  D. Bertsekas Approximate policy iteration: a survey and some new methods , 2011 .

[12]  Rémi Munos,et al.  From Bandits to Monte-Carlo Tree Search: The Optimistic Principle Applied to Optimization and Planning , 2014, Found. Trends Mach. Learn..

[13]  Bruno Scherrer,et al.  Performance Bounds for Lambda Policy Iteration and Application to the Game of Tetris , 2007 .

[14]  Martin L. Puterman,et al.  Markov Decision Processes: Discrete Stochastic Dynamic Programming , 1994 .

[15]  Han Liu,et al.  Feedback-Based Tree Search for Reinforcement Learning , 2018, ICML.

[16]  John N. Tsitsiklis,et al.  Neuro-dynamic programming: an overview , 1995, Proceedings of 1995 34th IEEE Conference on Decision and Control.

[17]  Richard S. Sutton,et al.  Reinforcement Learning: An Introduction , 1998, IEEE Trans. Neural Networks.

[18]  Matthew Lai,et al.  Giraffe: Using Deep Reinforcement Learning to Play Chess , 2015, ArXiv.

[19]  Nathan R. Sturtevant,et al.  Monte Carlo Tree Search with heuristic evaluations using implicit minimax backups , 2014, 2014 IEEE Conference on Computational Intelligence and Games.

[20]  P ? ? ? ? ? ? ? % ? ? ? ? , 1991 .

[21]  S. Ioffe,et al.  Temporal Differences-Based Policy Iteration and Applications in Neuro-Dynamic Programming , 1996 .

[22]  Bruno Scherrer,et al.  Non-Stationary Approximate Modified Policy Iteration , 2015, ICML.

[23]  Demis Hassabis,et al.  Mastering Chess and Shogi by Self-Play with a General Reinforcement Learning Algorithm , 2017, ArXiv.