Approximate Value Iteration with Temporally Extended Actions

Temporally extended actions have proven useful for reinforcement learning, but their duration also makes them valuable for efficient planning. The options framework provides a concrete way to implement and reason about temporally extended actions. Existing literature has demonstrated the value of planning with options empirically, but there is a lack of theoretical analysis formalizing when planning with options is more efficient than planning with primitive actions. We provide a general analysis of the convergence rate of a popular Approximate Value Iteration (AVI) algorithm called Fitted Value Iteration (FVI) with options. Our analysis reveals that longer duration options and a pessimistic estimate of the value function both lead to faster convergence. Furthermore, options can improve convergence even when they are suboptimal and sparsely distributed throughout the state-space. Next we consider the problem of generating useful options for planning based on a subset of landmark states. This suggests a new algorithm, Landmark-based AVI (LAVI), that represents the value function only at the landmark states. We analyze both FVI and LAVI using the proposed landmark-based options and compare the two algorithms. Our experimental results in three different domains demonstrate the key properties from the analysis. Our theoretical and experimental results demonstrate that options can play an important role in AVI by decreasing approximation error and inducing fast convergence.

[1]  Edsger W. Dijkstra,et al.  A note on two problems in connexion with graphs , 1959, Numerische Mathematik.

[2]  H. Scarf THE OPTIMALITY OF (S,S) POLICIES IN THE DYNAMIC INVENTORY PROBLEM , 1959 .

[3]  Nils J. Nilsson,et al.  A Formal Basis for the Heuristic Determination of Minimum Cost Paths , 1968, IEEE Trans. Syst. Sci. Cybern..

[4]  Jose Augusto Ramos Soares,et al.  Graph Spanners: a Survey , 1992 .

[5]  Martin L. Puterman,et al.  Markov Decision Processes: Discrete Stochastic Dynamic Programming , 1994 .

[6]  Leslie Pack Kaelbling,et al.  On the Complexity of Solving Markov Decision Problems , 1995, UAI.

[7]  John N. Tsitsiklis,et al.  Neuro-Dynamic Programming , 1996, Encyclopedia of Machine Learning.

[8]  Suresh P. Sethi,et al.  Optimality of (s, S) Policies in Inventory Models with Markovian Demand , 1995, Oper. Res..

[9]  Doina Precup,et al.  Multi-time Models for Temporally Abstract Planning , 1997, NIPS.

[10]  Doina Precup,et al.  Theoretical Results on Reinforcement Learning with Temporally Abstract Options , 1998, ECML.

[11]  Doina Precup,et al.  Between MDPs and Semi-MDPs: A Framework for Temporal Abstraction in Reinforcement Learning , 1999, Artif. Intell..

[12]  Jesse Hoey,et al.  SPUDD: Stochastic Planning using Decision Diagrams , 1999, UAI.

[13]  Andrew G. Barto,et al.  Automatic Discovery of Subgoals in Reinforcement Learning using Diverse Density , 2001, ICML.

[14]  Doina Precup,et al.  Learning Options in Reinforcement Learning , 2002, SARA.

[15]  S. Minner Multiple-supplier inventory models in supply chain management: A review , 2003 .

[16]  Shie Mannor,et al.  Dynamic abstraction in reinforcement learning via clustering , 2004, ICML.

[17]  Glenn A. Iba,et al.  A Heuristic Approach to the Discovery of Macro-Operators , 1989, Machine Learning.

[18]  Yishay Mansour,et al.  A Sparse Sampling Algorithm for Near-Optimal Planning in Large Markov Decision Processes , 1999, Machine Learning.

[19]  Andrew G. Barto,et al.  Using relative novelty to identify useful temporal abstractions in reinforcement learning , 2004, ICML.

[20]  Alicia P. Wolfe,et al.  Identifying useful subgoals in reinforcement learning by local graph partitioning , 2005, ICML.

[21]  Jean-Claude Latombe,et al.  Landmark-Based Robot Navigation , 1992, Algorithmica.

[22]  Martin A. Riedmiller Neural Fitted Q Iteration - First Experiences with a Data Efficient Neural Reinforcement Learning Method , 2005, ECML.

[23]  Peter Stone,et al.  Reinforcement Learning for RoboCup Soccer Keepaway , 2005, Adapt. Behav..

[24]  Rémi Munos,et al.  Error Bounds for Approximate Value Iteration , 2005, AAAI.

[25]  Peter Sanders,et al.  Highway Hierarchies Hasten Exact Shortest Path Queries , 2005, ESA.

[26]  Manuela M. Veloso,et al.  Probabilistic policy reuse in a reinforcement learning agent , 2006, AAMAS '06.

[27]  Csaba Szepesvári,et al.  Bandit Based Monte-Carlo Planning , 2006, ECML.

[28]  Andrew G. Barto,et al.  Building Portable Options: Skill Transfer in Reinforcement Learning , 2007, IJCAI.

[29]  Robert Givan,et al.  FF-Replan: A Baseline for Probabilistic Planning , 2007, ICAPS.

[30]  Csaba Szepesvári,et al.  Finite-Time Bounds for Fitted Value Iteration , 2008, J. Mach. Learn. Res..

[31]  Nicholas Roy,et al.  CORL: A Continuous-state Offset-dynamics Reinforcement Learner , 2008, UAI.

[32]  Peter Stone,et al.  Hierarchical model-based reinforcement learning: R-max + MAXQ , 2008, ICML '08.

[33]  Stefan Schaal,et al.  2008 Special Issue: Reinforcement learning of motor skills with policy gradients , 2008 .

[34]  Shie Mannor,et al.  Regularized Fitted Q-iteration: Application to Planning , 2008, EWRL.

[35]  Andrew G. Barto,et al.  Skill Discovery in Continuous Reinforcement Learning Domains using Skill Chaining , 2009, NIPS.

[36]  Alessandro Lazaric,et al.  Analysis of a Classification-based Policy Iteration Algorithm , 2010, ICML.

[37]  Satinder P. Singh,et al.  Linear options , 2010, AAMAS.

[38]  Scott Kuindersma,et al.  Constructing Skill Trees for Reinforcement Learning Agents from Demonstration Trajectories , 2010, NIPS.

[39]  Doina Precup,et al.  Optimal policy switching algorithms for reinforcement learning , 2010, AAMAS.

[40]  Csaba Szepesvári,et al.  Error Propagation for Approximate Policy and Value Iteration , 2010, NIPS.

[41]  Nicholas Roy,et al.  Efficient Planning under Uncertainty with Macro-actions , 2014, J. Artif. Intell. Res..

[42]  Marco Wiering,et al.  Connectionist reinforcement learning for intelligent unit micro management in StarCraft , 2011, The 2011 International Joint Conference on Neural Networks.

[43]  Leslie Pack Kaelbling,et al.  DetH*: Approximate Hierarchical Solution of Large Markov Decision Processes , 2011, IJCAI.

[44]  Matthieu Geist,et al.  Approximate Modified Policy Iteration , 2012, ICML.

[45]  David Silver,et al.  Compositional Planning Using Optimal Option Models , 2012, ICML.

[46]  Matthieu Geist,et al.  Approximate Modied Policy Iteration , 2012 .

[47]  Thomas G. Dietterich,et al.  PAC Optimal Planning for Invasive Species Management: Improved Exploration for Reinforcement Learning from Simulator-Defined MDPs , 2013, AAAI.

[48]  Shie Mannor,et al.  Temporal Difference Methods for the Variance of the Reward To Go , 2013, ICML.

[49]  Ned Djilali,et al.  GridLAB-D: An Agent-Based Simulation Framework for Smart Grids , 2014, J. Appl. Math..

[50]  Shie Mannor,et al.  Time-regularized interrupting options , 2014, ICML 2014.

[51]  Shie Mannor,et al.  Scaling Up Approximate Value Iteration with Options: Better Policies with Fewer Iterations , 2014, ICML.