Generalized Mean Estimation in Monte-Carlo Tree Search

We consider Monte-Carlo Tree Search (MCTS) applied to Markov Decision Processes (MDPs) and Partially Observable MDPs (POMDPs), and the well-known Upper Confidence bound for Trees (UCT) algorithm. In UCT, a tree with nodes (states) and edges (actions) is incrementally built by the expansion of nodes, and the values of nodes are updated through a backup strategy based on the average value of child nodes. However, it has been shown that with enough samples the maximum operator yields more accurate node value estimates than averaging. Instead of settling for one of these value estimates, we go a step further proposing a novel backup strategy which uses the power mean operator, which computes a value between the average and maximum value. We call our new approach Power-UCT, and argue how the use of the power mean operator helps to speed up the learning in MCTS. We theoretically analyze our method providing guarantees of convergence to the optimum. Finally, we empirically demonstrate the effectiveness of our method in well-known MDP and POMDP benchmarks, showing significant improvement in performance and convergence speed w.r.t. state of the art algorithms.

[1]  Rémi Coulom,et al.  Efficient Selectivity and Backup Operators in Monte-Carlo Tree Search , 2006, Computers and Games.

[2]  P. Bullen Handbook of means and their inequalities , 1987 .

[3]  Larry Wasserman,et al.  All of Statistics: A Concise Course in Statistical Inference , 2004 .

[4]  H. Jaap van den Herik,et al.  Progressive Strategies for Monte-Carlo Tree Search , 2008 .

[5]  Jan Willemson,et al.  Improved Monte-Carlo Search , 2006 .

[6]  Robert L. Winkler,et al.  The Optimizer's Curse: Skepticism and Postdecision Surprise in Decision Analysis , 2006, Manag. Sci..

[7]  G. G. Stokes "J." , 1890, The New Yale Book of Quotations.

[8]  David Silver,et al.  Combining online and offline knowledge in UCT , 2007, ICML '07.

[9]  Marcello Restelli,et al.  Estimating Maximum Expected Value through Gaussian Approximation , 2016, ICML.

[10]  Wojciech Zaremba,et al.  OpenAI Gym , 2016, ArXiv.

[11]  Thomas G. Dietterich What is machine learning? , 2020, Archives of Disease in Childhood.

[12]  Chris Watkins,et al.  Learning from delayed rewards , 1989 .

[13]  Spyridon Samothrakis,et al.  On Monte Carlo Tree Search and Reinforcement Learning , 2017, J. Artif. Intell. Res..

[14]  V. T. Rajan,et al.  Bayesian Inference in Monte-Carlo Tree Search , 2010, UAI.

[15]  Peter Stone,et al.  On the Analysis of Complex Backup Strategies in Monte Carlo Tree Search , 2016, ICML.

[16]  Jan Peters,et al.  Generalized Mean Estimation in Monte-Carlo Tree Search , 2020, IJCAI.

[17]  Danna Zhou,et al.  d. , 1934, Microbial pathogenesis.

[18]  Gavin Adrian Rummery Problem solving with reinforcement learning , 1995 .

[19]  Peter Auer,et al.  Finite-time Analysis of the Multiarmed Bandit Problem , 2002, Machine Learning.

[20]  Olivier Teytaud,et al.  Intelligent Agents for the Game of Go , 2010, IEEE Computational Intelligence Magazine.

[21]  Richard J. Lorentz Improving Monte-Carlo Tree Search in Havannah , 2010, Computers and Games.

[22]  Olivier Teytaud,et al.  On the huge benefit of decisive moves in Monte-Carlo Tree Search algorithms , 2010, Proceedings of the 2010 IEEE Conference on Computational Intelligence and Games.

[23]  Sylvain Gelly,et al.  Exploration exploitation in Go: UCT for Monte-Carlo Go , 2006, NIPS 2006.

[24]  David P. Helmbold,et al.  All-Moves-As-First Heuristics in Monte-Carlo Go , 2009, IC-AI.

[25]  David Tom,et al.  Investigating UCT and RAVE: steps towards a more robust method , 2010 .

[26]  Hado van Hasselt,et al.  Double Q-learning , 2010, NIPS.

[27]  Joseph Y. Halpern,et al.  Proceedings of the 20th conference on Uncertainty in artificial intelligence , 2004, UAI 2004.

[28]  Zoubin Ghahramani,et al.  Proceedings of the 24th international conference on Machine learning , 2007, ICML 2007.

[29]  Reid G. Simmons,et al.  Heuristic Search Value Iteration for POMDPs , 2004, UAI.

[30]  Michael I. Jordan,et al.  Advances in Neural Information Processing Systems 30 , 1995 .

[31]  P ? ? ? ? ? ? ? % ? ? ? ? , 1991 .

[32]  Tomáš Kozelek,et al.  Methods of MCTS and the game Arimaa , 2009 .

[33]  Demis Hassabis,et al.  Mastering the game of Go with deep neural networks and tree search , 2016, Nature.

[34]  W. Hoeffding Probability Inequalities for sums of Bounded Random Variables , 1963 .

[35]  Joel Veness,et al.  Monte-Carlo Planning in Large POMDPs , 2010, NIPS.

[36]  Levente Kocsis,et al.  Transpositions and move groups in Monte Carlo tree search , 2008, 2008 IEEE Symposium On Computational Intelligence and Games.

[37]  A. James 2010 , 2011, Philo of Alexandria: an Annotated Bibliography 2007-2016.

[38]  Ioannis Patras,et al.  Proceedings of the 2014 IEEE Conference on Computational Intelligence and Games , 2014, IEEE Conference on Computational Intelligence and Games.

[39]  Michael L. Littman,et al.  Sample-Based Planning for Continuous Action Markov Decision Processes , 2011, ICAPS.