Direct Gradient-Based Reinforcement Learning: II. Gradient Ascent Algorithms and Experiments

In [2] we introduced , an algorithm for computing arbitrarily accurate approximations to the performance gradient of parameterized partially observable Markov decision processes ( s). The algorithm’ s chief advantages are that it requires only a single sample path of the underlying Markov chain, it uses only one free parameter which has a natural interpretation in terms of bias-variance trade-off, and it requires no knowledge of the underlying state. In addition, the algorithm can be applied to infinite state, control and observation spaces. In this paper we present , a conjugate-gradient ascent algorithm that uses as a subroutine to estimate the gradient direction. uses a novel line-search routine that relies solely on gradient estimates and hence is robust to noise in the performance estimates. , an on-line gradient ascent algorithm based on is also presented. The chief theoretical advantage of this gradient based approach over valuefunction-based approaches to reinforcement learning is that it guarantees improvement in the performance of the policy at every step. To show that this advantage

[1]  Arthur L. Samuel,et al.  Some Studies in Machine Learning Using the Game of Checkers , 1967, IBM J. Res. Dev..

[2]  G. Tesauro Practical Issues in Temporal Difference Learning , 1992 .

[3]  Gerald Tesauro,et al.  TD-Gammon, a Self-Teaching Backgammon Program, Achieves Master-Level Play , 1994, Neural Computation.

[4]  Wei Zhang,et al.  A Reinforcement Learning Approach to job-shop Scheduling , 1995, IJCAI.

[5]  John N. Tsitsiklis,et al.  Neuro-Dynamic Programming , 1996, Encyclopedia of Machine Learning.

[6]  Dimitri P. Bertsekas,et al.  Reinforcement Learning for Dynamic Channel Allocation in Cellular Telephone Systems , 1996, NIPS.

[7]  Shigenobu Kobayashi,et al.  Reinforcement Learning in POMDPs with Function Approximation , 1997, ICML.

[8]  Xi-Ren Cao,et al.  Algorithms for sensitivity analysis of Markov systems through potentials and perturbation realization , 1998, IEEE Trans. Control. Syst. Technol..

[9]  P. Marbach Simulation-Based Methods for Markov Decision Processes , 1998 .

[10]  John N. Tsitsiklis,et al.  Simulation-based optimization of Markov reward processes , 1998, Proceedings of the 37th IEEE Conference on Decision and Control (Cat. No.98CH36171).

[11]  Terrence L. Fine,et al.  Feedforward Neural Network Methodology , 1999, Information Science and Statistics.

[12]  Jonathan Baxter,et al.  Reinforcement Learning From State and Temporal Differences , 1999 .

[13]  P. Bartlett,et al.  Direct Gradient-Based Reinforcement Learning: I. Gradient Estimation Algorithms , 1999 .

[14]  J. Baxter,et al.  Direct gradient-based reinforcement learning , 2000, 2000 IEEE International Symposium on Circuits and Systems. Emerging Technologies for the 21st Century. Proceedings (IEEE Cat No.00CH36353).