论文信息 - Reinforcement Learning in Situated Agents: Theoretical and Practical Solutions

Reinforcement Learning in Situated Agents: Theoretical and Practical Solutions

In on-line reinforcement learning, often a large number of estimation parameters (e.g. Q-value estimates for 1-step Q-learning) are maintained and dynamically updated as information comes to hand during the learning process. Excessive variance of these estimators can be problematic, resulting in uneven or unstable learning, or even making effective learning impossible. Estimator variance is usually managed only indirectly, by selecting global learning algorithm parameters (e.g. λ for TD(λ) based methods) that are a compromise between an acceptable level of estimator perturbation and other desirable system attributes, such as reduced estimator bias. In this paper, we argue that this approach may not always be adequate, particularly for noisy and non-Markovian domains, and present a direct approach to managing estimator variance, the ccBeta algorithm. Empirical results in an autonomous robotics domain are also presented showing improved performance using the new ccBeta method.

Mark D. Pendrith

[1] Ben J. A. Kröse,et al. Learning from delayed rewards , 1995, Robotics Auton. Syst..

[2] Richard S. Sutton,et al. Introduction to Reinforcement Learning , 1998 .

[3] H. Kesten. Accelerated Stochastic Approximation , 1958 .

[4] Mark D. Pendrith,et al. Actual Return Reinforcement Learning versus Temporal Differences: Some Theoretical and Experimental Results , 1996, ICML.

[5] Richard S. Sutton,et al. Adapting Bias by Gradient Descent: An Incremental Version of Delta-Bar-Delta , 1992, AAAI.

[6] Rodney A. Brooks,et al. Intelligence Without Reason , 1991, IJCAI.

[7] George N. Saridis,et al. Learning Applied to Successive Approximation Algorithms , 1970, IEEE Trans. Syst. Sci. Cybern..

[8] Harold J. Kushner,et al. wchastic. approximation methods for constrained and unconstrained systems , 1978 .

[9] Robert A. Jacobs,et al. Increased rates of convergence through learning rate adaptation , 1987, Neural Networks.

[10] Richard S. Sutton,et al. Goal Seeking Components for Adaptive Intelligence: An Initial Assessment. , 1981 .

[11] John Moody,et al. Learning rate schedules for faster stochastic gradient search , 1992, Neural Networks for Signal Processing II Proceedings of the 1992 IEEE Workshop.