In reinforcement learning, as in many on-line search techniques, a large number of estimation parameters (e.g. Q-value estimates for 1-step Q-learning) are maintained and dynamically updated as information comes to hand during the learning process. Excessive variance of these estimators can be problematic, resulting in uneven or unstable learning, or even making effective learning impossible. Estimator variance is usually managed only indirectly, by selecting global learning algorithm parameters (e.g. A for TD(A) based methods) that axe a compromise between an acceptable level of estimator perturbation and other desirable system attributes, such as reduced estimator bias. In this paper, we argue that this approach may not always be adequate, particularly for noisy and non-Markovian domains, and present a direct approach to managing estimator variance, the new ccBeta algorithm. Empirical results in an autonomous robotics domain are also presented showing improved performance using the ccBeta method.
[1]
Harold J. Kushner,et al.
wchastic. approximation methods for constrained and unconstrained systems
,
1978
.
[2]
Michel Installe,et al.
Stochastic approximation methods
,
1978
.
[3]
C. Watkins.
Learning from delayed rewards
,
1989
.
[4]
Rodney A. Brooks,et al.
Intelligence Without Reason
,
1991,
IJCAI.
[5]
R.J. Williams,et al.
Reinforcement learning is direct adaptive optimal control
,
1991,
IEEE Control Systems.
[6]
Martin L. Puterman,et al.
Markov Decision Processes: Discrete Stochastic Dynamic Programming
,
1994
.
[7]
Richard S. Sutton,et al.
Reinforcement Learning with Replacing Eligibility Traces
,
2005,
Machine Learning.
[8]
Mark D. Pendrith,et al.
Actual Return Reinforcement Learning versus Temporal Differences: Some Theoretical and Experimental Results
,
1996,
ICML.