论文信息 - Scaling Up Approximate Value Iteration with Options: Better Policies with Fewer Iterations

Scaling Up Approximate Value Iteration with Options: Better Policies with Fewer Iterations

We show how options, a class of control structures encompassing primitive and temporally extended actions, can play a valuable role in planning in MDPs with continuous state-spaces. Analyzing the convergence rate of Approximate Value Iteration with options reveals that for pessimistic initial value function estimates, options can speed up convergence compared to planning with only primitive actions even when the temporally extended actions are suboptimal and sparsely scattered throughout the state-space. Our experimental results in an optimal replacement task and a complex inventory management task demonstrate the potential for options to speed up convergence in practice. We show that options induce faster convergence to the optimal value function, which implies deriving better policies with fewer iterations.

Shie Mannor | Timothy A. Mann | Shie Mannor

[1] Rémi Munos,et al. Error Bounds for Approximate Value Iteration , 2005, AAAI.

[2] Doina Precup,et al. Theoretical Results on Reinforcement Learning with Temporally Abstract Options , 1998, ECML.

[3] Andrew G. Barto,et al. Automatic Discovery of Subgoals in Reinforcement Learning using Diverse Density , 2001, ICML.

[4] Nicholas Roy,et al. Efficient Planning under Uncertainty with Macro-actions , 2014, J. Artif. Intell. Res..

[5] Milos Hauskrecht,et al. Hierarchical Solution of Markov Decision Processes using Macro-actions , 1998, UAI.

[6] David Silver,et al. Compositional Planning Using Optimal Option Models , 2012, ICML.

[7] Shie Mannor,et al. Dynamic abstraction in reinforcement learning via clustering , 2004, ICML.

[8] Minho Lee,et al. Autonomous and Interactive Improvement of Binocular Visual Depth Estimation through Sensorimotor Interaction , 2013, IEEE Transactions on Autonomous Mental Development.

[9] Doina Precup,et al. Between MDPs and Semi-MDPs: A Framework for Temporal Abstraction in Reinforcement Learning , 1999, Artif. Intell..

[10] Csaba Szepesvári,et al. Finite-Time Bounds for Fitted Value Iteration , 2008, J. Mach. Learn. Res..

[11] Leslie Pack Kaelbling,et al. Approximate Planning in POMDPs with Macro-Actions , 2003, NIPS.

[12] H. Scarf. THE OPTIMALITY OF (S,S) POLICIES IN THE DYNAMIC INVENTORY PROBLEM , 1959 .

[13] Doina Precup,et al. Learning Options in Reinforcement Learning , 2002, SARA.