Multi-player residual advantage learning with general function

A new algorithm, advantage learning, is presented that improves on advantage updating by requiring that a single function be learned rather than two. Furthermore, advantage learning requires only a single type of update, the learning update, while advantage updating requires two different types of updates, a learning update and a normilization update. The reinforcement learning system uses the residual form of advantage learning. An application of reinforcement learning to a Markov game is presented. The test-bed has continuous states and nonlinear dynamics. The game consists of two players, a missile and a plane; the missile pursues the plane and the plane evades the missile. On each time step, each player chooses one of two possible actions; turn left or turn right, resulting in a 90 degree instantaneous change in the aircraft’ s heading. Reinforcement is given only when the missile hits the plane or the plane reaches an escape distance from the missile. The advantage function is stored in a single-hidden-layer sigmoidal network. Speed of learning is increased by a new algorithm, Incremental Delta-Delta (IDD), which extends Jacobs’ (1988) Delta-Delta for use in incremental training, and differs from Sutton’s Incremental Delta-Bar-Delta (1992) in that it does not require the use of a trace and is amenable for use with general function approximation systems. The advantage learning algorithm for optimal control is modified for games in order to find the minimax point, rather than the maximum. Empirical results gathered using the missile/aircraft test-bed validate theory that suggests residual forms of reinforcement learning algorithms converge to a local minimum of the mean squared Bellman residual when using general function approximation systems. Also, to our knowledge, this is the first time an approximate second order method has been used with residual algorithms. Empirical results are presented comparing convergence rates with and without the use of IDD for the reinforcement learning test-bed described above and for a supervised learning test-bed. The results of these experiments demonstrate IDD increased the rate of convergence and resulted in an order of magnitude lower total asymptotic error than when using backpropagation alone.