Bootstrapping from Game Tree Search

In this paper we introduce a new algorithm for updating the parameters of a heuristic evaluation function, by updating the heuristic towards the values computed by an alpha-beta search. Our algorithm differs from previous approaches to learning from search, such as Samuel's checkers player and the TD-Leaf algorithm, in two key ways. First, we update all nodes in the search tree, rather than a single node. Second, we use the outcome of a deep search, instead of the outcome of a subsequent search, as the training signal for the evaluation function. We implemented our algorithm in a chess program Meep, using a linear heuristic function. After initialising its weight vector to small random values, Meep was able to learn high quality weights from self-play alone. When tested online against human opponents, Meep played at a master level, the best performance of any chess program with a heuristic learned entirely from self-play.

[1]  Arthur L. Samuel,et al.  Some Studies in Machine Learning Using the Game of Checkers , 1967, IBM J. Res. Dev..

[2]  Jonathan Schaeffer,et al.  The History Heuristic and Alpha-Beta Search Enhancements in Practice , 1989, IEEE Trans. Pattern Anal. Mach. Intell..

[3]  Donald F. Beal,et al.  A Generalised Quiescence Search Algorithm , 1990, Artif. Intell..

[4]  Gerald Tesauro,et al.  TD-Gammon, a Self-Teaching Backgammon Program, Achieves Master-Level Play , 1994, Neural Computation.

[5]  Donald F. Beal,et al.  Learning Piece Values Using Temporal Differences , 1997, J. Int. Comput. Games Assoc..

[6]  Michael Buro,et al.  From Simple Features to Sophisticated Evaluation Functions , 1998, Computers and Games.

[7]  Andrew Tridgell,et al.  KnightCap: A Chess Programm That Learns by Combining TD(lambda) with Game-Tree Search , 1998, ICML.

[8]  Jonathan Schaeffer,et al.  Temporal Difference Learning Applied to a High-Performance Game-Playing Program , 2001, IJCAI.

[9]  Murray Campbell,et al.  Deep Blue , 2002, Artif. Intell..

[10]  Jonathan Schaeffer,et al.  Rediscovering *-Minimax Search , 2004, Computers and Games.

[11]  Richard S. Sutton,et al.  Learning to predict by the methods of temporal differences , 1988, Machine Learning.

[12]  David Silver,et al.  Combining online and offline knowledge in UCT , 2007, ICML '07.

[13]  Joel Veness,et al.  Effective Use of Transposition Tables in Stochastic Game Tree Search , 2007, 2007 IEEE Symposium on Computational Intelligence and Games.

[14]  David Silver,et al.  Combining Online and Offline Learning in UCT , 2007 .