论文信息 - Gradient Descent for General Reinforcement Learning

Gradient Descent for General Reinforcement Learning

A simple learning rule is derived, the VAPS algorithm, which can be instantiated to generate a wide range of new reinforcement-learning algorithms. These algorithms solve a number of open problems, define several new approaches to reinforcement learning, and unify different approaches to reinforcement learning under a single theory. These algorithms all have guaranteed convergence, and include modifications of several existing algorithms that were known to fail to converge on simple MOPs. These include Q-learning, SARSA, and advantage learning. In addition to these value-based algorithms it also generates pure policy-search reinforcement-learning algorithms, which learn optimal policies without learning a value function. In addition, it allows policy-search and value-based algorithms to be combined, thus unifying two very different approaches to reinforcement learning into a single Value and Policy Search (VAPS) algorithm. And these algorithms converge for POMDPs without requiring a proper belief state. Simulations results are given, and several areas for future research are discussed.

Andrew W. Moore | Leemon C. Baird | L. Baird | A. Moore

[1] Kumpati S. Narendra,et al. Learning automata - an introduction , 1989 .

[2] Michael I. Jordan,et al. Advances in Neural Information Processing Systems 30 , 1995 .

[3] Vijaykumar Gullapalli,et al. Reinforcement learning and its application to control , 1992 .

[4] V. Tresp,et al. Missing and noisy data in nonlinear time-series prediction , 1995, Proceedings of 1995 IEEE Workshop on Neural Networks for Signal Processing.

[5] Geoffrey J. Gordon. Stable Fitted Reinforcement Learning , 1995, NIPS.

[6] Leemon C. Baird,et al. Residual Algorithms: Reinforcement Learning with Function Approximation , 1995, ICML.

[7] Andrew McCallum,et al. Reinforcement learning with selective perception and hidden state , 1996 .

[8] Leslie Pack Kaelbling,et al. Planning and Acting in Partially Observable Stochastic Domains , 1998, Artif. Intell..

[9] Peter Marbach,et al. Simulation-based optimization of Markov decision processes , 1998 .