论文信息 - Fitted Q-iteration in continuous action-space MDPs

Fitted Q-iteration in continuous action-space MDPs

We consider continuous state, continuous action batch reinforcement learning where the goal is to learn a good policy from a sufficiently rich trajectory generated by some policy. We study a variant of fitted Q-iteration, where the greedy action selection is replaced by searching for a policy in a restricted set of candidate policies by maximizing the average action values. We provide a rigorous analysis of this algorithm, proving what we believe is the first finite-time bound for value-function based algorithms for continuous state and action problems.

[1] A. Kolmogorov,et al. Entropy and "-capacity of sets in func-tional spaces , 1961 .

[2] Andrew W. Moore,et al. Generalization in Reinforcement Learning: Safely Approximating the Value Function , 1994, NIPS.

[3] Bin Yu. RATES OF CONVERGENCE FOR EMPIRICAL PROCESSES OF STATIONARY MIXING SEQUENCES , 1994 .

[4] Philip M. Long,et al. Fat-shattering and the learnability of real-valued functions , 1994, COLT '94.

[5] Geoffrey J. Gordon. Stable Function Approximation in Dynamic Programming , 1995, ICML.

[6] David Haussler,et al. Sphere Packing Numbers for Subsets of the Boolean n-Cube with Bounded Vapnik-Chervonenkis Dimension , 1995, J. Comb. Theory, Ser. A.

[7] Richard S. Sutton,et al. Introduction to Reinforcement Learning , 1998 .

[8] Peter L. Bartlett,et al. Learning in Neural Networks: Theoretical Foundations , 1999 .

[9] Peter L. Bartlett,et al. Neural Network Learning - Theoretical Foundations , 1999 .

[10] Nello Cristianini,et al. An Introduction to Support Vector Machines and Other Kernel-based Learning Methods , 2000 .

[11] Michael I. Jordan,et al. PEGASUS: A policy search method for large MDPs and POMDPs , 2000, UAI.

[12] Leonid Peshkin,et al. Learning from Scarce Experience , 2002, ICML.

[13] Michail G. Lagoudakis,et al. Least-Squares Policy Iteration , 2003, J. Mach. Learn. Res..

[14] Ron Meir,et al. Nonparametric Time Series Prediction Through Adaptive Model Selection , 2000, Machine Learning.

[15] Martin A. Riedmiller. Neural Fitted Q Iteration - First Experiences with a Data Efficient Neural Reinforcement Learning Method , 2005, ECML.

[16] Csaba Szepesvári,et al. Finite time bounds for sampling based fitted value iteration , 2005, ICML.

[17] Pierre Geurts,et al. Tree-Based Batch Mode Reinforcement Learning , 2005, J. Mach. Learn. Res..

[18] Richard S. Sutton,et al. Reinforcement Learning: An Introduction , 1998, IEEE Trans. Neural Networks.

[19] Douglas Aberdeen,et al. Policy-Gradient Methods for Planning , 2005, NIPS.

[20] Liming Xiang,et al. Kernel-Based Reinforcement Learning , 2006, ICIC.

[21] Ambuj Tewari,et al. Sample Complexity of Policy Search with Known Dynamics , 2006, NIPS.

[22] Csaba Szepesvári,et al. Learning Near-Optimal Policies with Bellman-Residual Minimization Based Fitted Policy Iteration and a Single Sample Path , 2006, COLT.

[23] A. Antos,et al. Value-Iteration Based Fitted Policy Iteration: Learning with a Single Trajectory , 2007, 2007 IEEE International Symposium on Approximate Dynamic Programming and Reinforcement Learning.

[24] Peter Stone,et al. Batch reinforcement learning in a complex domain , 2007, AAMAS '07.

[25] Dimitri P. Bertsekas,et al. Stochastic optimal control : the discrete time case , 2007 .

[26] Csaba Szepesvári,et al. Learning near-optimal policies with Bellman-residual minimization based fitted policy iteration and a single sample path , 2006, Machine Learning.