Error Reducing Sampling in Reinforcement Learning

In reinforcement learning, an agent collects information interacting with an environment and uses it to derive a behavior. This paper focuses on efficient sampling; that is, the problem of choosing the interaction samples so that the corresponding behavior tends quickly to the optimal behavior. Our main result is a sensitivity analysis relating the choice of sampling any state-action pair to the decrease of an error bound on the optimal solution. We derive two new model-based algorithms. Simulations demonstrate a quicker convergence (in the sense of the number of samples) of the value function to the real optimal value function.

[1]  Rémi Munos Efficient Resources Allocation for Markov Decision Processes , 2001, NIPS.

[2]  Richard S. Sutton,et al.  Dyna, an integrated architecture for learning, planning, and reacting , 1990, SGAR.

[3]  U. Rieder,et al.  Markov Decision Processes , 2010 .

[4]  Shie Mannor,et al.  Action Elimination and Stopping Conditions for Reinforcement Learning , 2003, ICML.

[5]  Pravin Varaiya,et al.  Stochastic Systems: Estimation, Identification, and Adaptive Control , 1986 .

[6]  Andrew W. Moore,et al.  Rates of Convergence for Variable Resolution Schemes in Optimal Control , 2000, ICML.

[7]  Michael Kearns,et al.  Efficient Reinforcement Learning in Factored MDPs , 1999, IJCAI.

[8]  Shlomo Zilberstein,et al.  Planetary Rover Control as a Markov Decision Process , 2002 .

[9]  Richard S. Sutton,et al.  Reinforcement Learning: An Introduction , 1998, IEEE Trans. Neural Networks.

[10]  Patchigolla Kiran Kumar,et al.  A Survey of Some Results in Stochastic Adaptive Control , 1985 .

[11]  Gerald Tesauro,et al.  Temporal Difference Learning and TD-Gammon , 1995, J. Int. Comput. Games Assoc..

[12]  Ronald A. Howard,et al.  Dynamic Programming and Markov Processes , 1960 .

[13]  Gerald Tesauro,et al.  Temporal difference learning and TD-Gammon , 1995, CACM.

[14]  Martin L. Puterman,et al.  Markov Decision Processes: Discrete Stochastic Dynamic Programming , 1994 .

[15]  Michael Kearns,et al.  Near-Optimal Reinforcement Learning in Polynomial Time , 2002, Machine Learning.

[16]  W. Hoeffding Probability Inequalities for sums of Bounded Random Variables , 1963 .