论文信息 - The Optimal Reward Baseline for Gradient-Based Reinforcement Learning

The Optimal Reward Baseline for Gradient-Based Reinforcement Learning

There exist a number of reinforcement learning algorithms which learn by climbing the gradient of expected reward. Their long-run convergence has been proved, even in partially observable environments with non-deterministic actions, and without the need for a system model. However, the variance of the gradient estimator has been found to be a significant practical problem. Recent approaches have discounted future rewards, introducing a bias-variance trade-off into the gradient estimate. We incorporate a reward baseline into the learning system, and show that it affects variance without introducing further bias. In particular, as we approach the zerobias, high-variance parametedzation, the optimal (or variance minimizing) constant reward baseline is equal to the long-term average expected reward. Modified policy-gradient algorithms are presented, and a number of experiments demonstrate their improvement over previous work.

Lex Weaver | Nigel Tao | Lex Weaver | Nigel Tao

[1] Richard S. Sutton,et al. Temporal credit assignment in reinforcement learning , 1984 .

[2] Peter W. Glynn,et al. Stochastic approximation for Monte Carlo optimization , 1986, WSC '86.

[3] David S. Touretzky,et al. Advances in neural information processing systems 2 , 1989 .

[4] Michael I. Jordan,et al. Advances in Neural Information Processing Systems 30 , 1995 .

[5] Michael C. Fu,et al. Smoothed perturbation analysis derivative estimation for Markov chains , 1994, Oper. Res. Lett..

[6] Michael I. Jordan,et al. Reinforcement Learning Algorithm for Partially Observable Markov Decision Problems , 1994, NIPS.

[7] Shigenobu Kobayashi,et al. Reinforcement Learning by Stochastic Hill Climbing on Discounted Reward , 1995, ICML.

[8] Shigenobu Kobayashi,et al. Reinforcement Learning in POMDPs with Function Approximation , 1997, ICML.

[9] Xi-Ren Cao,et al. Perturbation realization, potentials, and sensitivity analysis of Markov processes , 1997, IEEE Trans. Autom. Control..

[10] Xi-Ren Cao,et al. Algorithms for sensitivity analysis of Markov systems through potentials and perturbation realization , 1998, IEEE Trans. Control. Syst. Technol..

[11] Richard S. Sutton,et al. Introduction to Reinforcement Learning , 1998 .

[12] John N. Tsitsiklis,et al. Simulation-based optimization of Markov reward processes , 1998, Proceedings of the 37th IEEE Conference on Decision and Control (Cat. No.98CH36171).

[13] P. Bartlett,et al. Direct Gradient-Based Reinforcement Learning: I. Gradient Estimation Algorithms , 1999 .

[14] P. Bartlett,et al. Stochastic optimization of controlled partially observable Markov decision processes , 2000, Proceedings of the 39th IEEE Conference on Decision and Control (Cat. No.00CH37187).

[15] J. Baxter,et al. Direct gradient-based reinforcement learning , 2000, 2000 IEEE International Symposium on Circuits and Systems. Emerging Technologies for the 21st Century. Proceedings (IEEE Cat No.00CH36353).

[16] R. J. Williams,et al. Simple Statistical Gradient-Following Algorithms for Connectionist Reinforcement Learning , 2004, Machine Learning.