Efficient data reuse in value function approximation

Off-policy reinforcement learning is aimed at efficiently using data samples gathered from a policy that is different from the currently optimized policy. A common approach is to use importance sampling techniques for compensating for the bias of value function estimators caused by the difference between the data-sampling policy and the target policy. However, existing off-policy methods often do not take the variance of the value function estimators explicitly into account and therefore their performance tends to be unstable. To cope with this problem, we propose using an adaptive importance sampling technique which allows us to actively control the trade-off between bias and variance. We further provide a method for optimally determining the trade-off parameter based on a variant of cross-validation. The usefulness of the proposed approach is demonstrated through simulated swing-up inverted-pendulum problem.

[1]  Sanjoy Dasgupta,et al.  Off-Policy Temporal Difference Learning with Function Approximation , 2001, ICML.

[2]  H. Shimodaira,et al.  Improving predictive inference under covariate shift by weighting the log-likelihood function , 2000 .

[3]  Christian R. Shelton,et al.  Policy Improvement for POMDPs Using Normalized Importance Sampling , 2001, UAI.

[4]  Jeff G. Schneider,et al.  Policy Search by Dynamic Programming , 2003, NIPS.

[5]  Ronald L. Wasserstein,et al.  Monte Carlo: Concepts, Algorithms, and Applications , 1997 .

[6]  Leonid Peshkin,et al.  Learning from Scarce Experience , 2002, ICML.

[7]  Sham M. Kakade,et al.  A Natural Policy Gradient , 2001, NIPS.

[8]  Calyampudi R. Rao,et al.  Linear Statistical Inference and Its Applications. , 1975 .

[9]  Ralf Schoknecht,et al.  Optimality of Reinforcement Learning Algorithms with Linear Function Approximation , 2002, NIPS.

[10]  Richard S. Sutton,et al.  Reinforcement Learning: An Introduction , 1998, IEEE Trans. Neural Networks.

[11]  M. Bugeja,et al.  Non-linear swing-up and stabilizing control of an inverted pendulum system , 2003, The IEEE Region 8 EUROCON 2003. Computer as a Tool..

[12]  Calyampudi R. Rao,et al.  Linear statistical inference and its applications , 1965 .

[13]  Michail G. Lagoudakis,et al.  Least-Squares Policy Iteration , 2003, J. Mach. Learn. Res..

[14]  Klaus-Robert Müller,et al.  Covariate Shift Adaptation by Importance Weighted Cross Validation , 2007, J. Mach. Learn. Res..

[15]  Stefan Schaal,et al.  Reinforcement learning by reward-weighted regression for operational space control , 2007, ICML '07.

[16]  John N. Tsitsiklis,et al.  Neuro-Dynamic Programming , 1996, Encyclopedia of Machine Learning.

[17]  Doina Precup,et al.  Eligibility Traces for Off-Policy Policy Evaluation , 2000, ICML.

[18]  Kazuo Tanaka,et al.  An approach to fuzzy control of nonlinear systems: stability and design issues , 1996, IEEE Trans. Fuzzy Syst..

[19]  Stefan Schaal,et al.  2008 Special Issue: Reinforcement learning of motor skills with policy gradients , 2008 .