Learning from Scarce Experience

Searching the space of policies directly for the optimal policy has been one popular method for solving partially observable reinforcement learning problems. Typically, with each change of the target policy, its value is estimated from the results of following that very policy. This requires a large number of interactions with the environment as different polices are considered. We present a family of algorithms based on likelihood ratio estimation that use data gathered when executing one policy (or collection of policies) to estimate the value of a different policy. The algorithms combine estimation and optimization stages. The former utilizes experience to build a non-parametric representation of an optimized function. The latter performs optimization on this estimate. We show positive empirical results and provide the sample complexity bound.

[1]  Reuven Y. Rubinstein,et al.  Simulation and the Monte Carlo method , 1981, Wiley series in probability and mathematical statistics.

[2]  Peter W. Glynn,et al.  Optimization of stochastic systems , 1986, WSC '86.

[3]  Donald L. Iglehart,et al.  Importance sampling for stochastic simulations , 1989 .

[4]  Peter W. Glynn,et al.  Likelihood ratio gradient estimation for stochastic systems , 1990, CACM.

[5]  Jing Peng,et al.  Function Optimization using Connectionist Reinforcement Learning Algorithms , 1991 .

[6]  P. Glynn Importance sampling for markov chains: asymptotics for the variance , 1994 .

[7]  P. Marbach Simulation-Based Methods for Markov Decision Processes , 1998 .

[8]  Vladimir Vapnik,et al.  Statistical learning theory , 1998 .

[9]  Neri Merhav,et al.  Universal Prediction , 1998, IEEE Trans. Inf. Theory.

[10]  Andrew W. Moore,et al.  Gradient Descent for General Reinforcement Learning , 1998, NIPS.

[11]  Leslie Pack Kaelbling,et al.  Learning Policies with External Memory , 1999, ICML.

[12]  Yishay Mansour,et al.  Approximate Planning in Large POMDPs via Reusable Trajectories , 1999, NIPS.

[13]  Yishay Mansour,et al.  Policy Gradient Methods for Reinforcement Learning with Function Approximation , 1999, NIPS.

[14]  Kee-Eung Kim,et al.  Learning Finite-State Controllers for Partially Observable Environments , 1999, UAI.

[15]  Kee-Eung Kim,et al.  Learning to Cooperate via Policy Search , 2000, UAI.

[16]  Peter W. Glynn,et al.  Kernel-Based Reinforcement Learning in Average-Cost Problems: An Application to Optimal Portfolio Choice , 2000, NIPS.

[17]  Doina Precup,et al.  Eligibility Traces for Off-Policy Policy Evaluation , 2000, ICML.

[18]  Peter L. Bartlett,et al.  Reinforcement Learning in POMDP's via Direct Gradient Ascent , 2000, ICML.

[19]  Nicolas Meuleau,et al.  Exploration in Gradient-Based Reinforcement Learning , 2001 .

[20]  Leonid Peshkin,et al.  Bounds on Sample Size for Policy Evaluation in Markov Environments , 2001, COLT/EuroCOLT.

[21]  Christian R. Shelton,et al.  Policy Improvement for POMDPs Using Normalized Importance Sampling , 2001, UAI.

[22]  Christian R. Shelton,et al.  Importance sampling for reinforcement learning with multiple objectives , 2001 .

[23]  Ronald J. Williams,et al.  Simple Statistical Gradient-Following Algorithms for Connectionist Reinforcement Learning , 2004, Machine Learning.