Learning to evaluate Go positions via temporal difference methods

The game of Go has a high branching factor that defeats the tree search approach used in computer chess, and long-range spatiotemporal interactions that make position evaluation extremely difficult. Development of conventional Go programs is hampered by their knowledge-intensive nature. We demonstrate a viable alternative by training neural networks to evaluate Go positions via temporal difference (TD) learning. Our approach is based on neural network architectures that reflect the spatial organization of both input and reinforcement signals on the Go board, and training protocols that provide exposure to competent (though unlabelled) play. These techniques yield far better performance than undifferentiated networks trained by self-play alone. A network with less than 500 weights learned within 3000 games of 9x9 Go a position evaluation function superior to that of a commercial Go program.

[1]  Russell Greiner,et al.  Computational learning theory and natural learning systems , 1997 .

[2]  TesauroGerald Practical Issues in Temporal Difference Learning , 1992 .

[3]  Ben J. A. Kröse,et al.  Learning from delayed rewards , 1995, Robotics Auton. Syst..

[4]  M. Enzenberger The Integration of A Priori Knowledge into a Go Playing Neural Network , 1996 .

[5]  Donald Geman,et al.  Stochastic Relaxation, Gibbs Distributions, and the Bayesian Restoration of Images , 1984, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[6]  John N. Tsitsiklis,et al.  Neuro-Dynamic Programming , 1996, Encyclopedia of Machine Learning.

[7]  Gerald Tesauro,et al.  TD-Gammon, a Self-Teaching Backgammon Program, Achieves Master-Level Play , 1994, Neural Computation.

[8]  D. A. Mechner,et al.  All Systems Go , 1998 .

[9]  D. Sandbach All systems go. , 1986, The Health service journal.

[10]  Bernd Brügmann Max-Planck Monte Carlo Go , 1993 .

[11]  Fredrik A. Dahl,et al.  Honte, a go-playing program using neural nets , 2001 .

[12]  Lawrence D. Jackel,et al.  Backpropagation Applied to Handwritten Zip Code Recognition , 1989, Neural Computation.

[13]  C. D. Gelatt,et al.  Optimization by Simulated Annealing , 1983, Science.

[14]  Arthur L. Samuel,et al.  Some Studies in Machine Learning Using the Game of Checkers , 1959, IBM J. Res. Dev..

[15]  Takayuki Ito,et al.  Neocognitron: A neural network model for a mechanism of visual pattern recognition , 1983, IEEE Transactions on Systems, Man, and Cybernetics.

[16]  Herbert D. Enderton The Golem Go Program , 1991 .

[17]  Terrence J. Sejnowski,et al.  Temporal Difference Learning of Position Evaluation in the Game of Go , 1993, NIPS.

[18]  James A. Anderson,et al.  Neurocomputing: Foundations of Research , 1988 .

[19]  Peter Dayan,et al.  Improving Generalization for Temporal Difference Learning: The Successor Representation , 1993, Neural Computation.