论文信息 - Bootstrapping from Game Tree Search

Bootstrapping from Game Tree Search

In this paper we introduce a new algorithm for updating the parameters of a heuristic evaluation function, by updating the heuristic towards the values computed by an alpha-beta search. Our algorithm differs from previous approaches to learning from search, such as Samuel's checkers player and the TD-Leaf algorithm, in two key ways. First, we update all nodes in the search tree, rather than a single node. Second, we use the outcome of a deep search, instead of the outcome of a subsequent search, as the training signal for the evaluation function. We implemented our algorithm in a chess program Meep, using a linear heuristic function. After initialising its weight vector to small random values, Meep was able to learn high quality weights from self-play alone. When tested online against human opponents, Meep played at a master level, the best performance of any chess program with a heuristic learned entirely from self-play.

[1] Arthur L. Samuel,et al. Some Studies in Machine Learning Using the Game of Checkers , 1967, IBM J. Res. Dev..

[2] Jonathan Schaeffer,et al. The History Heuristic and Alpha-Beta Search Enhancements in Practice , 1989, IEEE Trans. Pattern Anal. Mach. Intell..

[3] Donald F. Beal,et al. A Generalised Quiescence Search Algorithm , 1990, Artif. Intell..

[4] Gerald Tesauro,et al. TD-Gammon, a Self-Teaching Backgammon Program, Achieves Master-Level Play , 1994, Neural Computation.

[5] Donald F. Beal,et al. Learning Piece Values Using Temporal Differences , 1997, J. Int. Comput. Games Assoc..

[6] Michael Buro,et al. From Simple Features to Sophisticated Evaluation Functions , 1998, Computers and Games.

[7] Andrew Tridgell,et al. KnightCap: A Chess Programm That Learns by Combining TD(lambda) with Game-Tree Search , 1998, ICML.

[8] Jonathan Schaeffer,et al. Temporal Difference Learning Applied to a High-Performance Game-Playing Program , 2001, IJCAI.

[9] Murray Campbell,et al. Deep Blue , 2002, Artif. Intell..

[10] Jonathan Schaeffer,et al. Rediscovering *-Minimax Search , 2004, Computers and Games.

[11] Richard S. Sutton,et al. Learning to predict by the methods of temporal differences , 1988, Machine Learning.

[12] David Silver,et al. Combining online and offline knowledge in UCT , 2007, ICML '07.

[13] Joel Veness,et al. Effective Use of Transposition Tables in Stochastic Game Tree Search , 2007, 2007 IEEE Symposium on Computational Intelligence and Games.

[14] David Silver,et al. Combining Online and Offline Learning in UCT , 2007 .