A simple learning rule is derived, the VAPS algorithm, which can be instantiated to generate a wide range of new reinforcement-learning algorithms. These algorithms solve a number of open problems, define several new approaches to reinforcement learning, and unify different approaches to reinforcement learning under a single theory. These algorithms all have guaranteed convergence, and include modifications of several existing algorithms that were known to fail to converge on simple MOPs. These include Q-learning, SARSA, and advantage learning. In addition to these value-based algorithms it also generates pure policy-search reinforcement-learning algorithms, which learn optimal policies without learning a value function. In addition, it allows policy-search and value-based algorithms to be combined, thus unifying two very different approaches to reinforcement learning into a single Value and Policy Search (VAPS) algorithm. And these algorithms converge for POMDPs without requiring a proper belief state. Simulations results are given, and several areas for future research are discussed.
[1]
Kumpati S. Narendra,et al.
Learning automata - an introduction
,
1989
.
[2]
Michael I. Jordan,et al.
Advances in Neural Information Processing Systems 30
,
1995
.
[3]
Vijaykumar Gullapalli,et al.
Reinforcement learning and its application to control
,
1992
.
[4]
V. Tresp,et al.
Missing and noisy data in nonlinear time-series prediction
,
1995,
Proceedings of 1995 IEEE Workshop on Neural Networks for Signal Processing.
[5]
Geoffrey J. Gordon.
Stable Fitted Reinforcement Learning
,
1995,
NIPS.
[6]
Leemon C. Baird,et al.
Residual Algorithms: Reinforcement Learning with Function Approximation
,
1995,
ICML.
[7]
Andrew McCallum,et al.
Reinforcement learning with selective perception and hidden state
,
1996
.
[8]
Leslie Pack Kaelbling,et al.
Planning and Acting in Partially Observable Stochastic Domains
,
1998,
Artif. Intell..
[9]
Peter Marbach,et al.
Simulation-based optimization of Markov decision processes
,
1998
.