Hierarchical Reinforcement Learning with the MAXQ Value Function Decomposition

This paper presents a new approach to hierarchical reinforcement learning based on decomposing the target Markov decision process (MDP) into a hierarchy of smaller MDPs and decomposing the value function of the target MDP into an additive combination of the value functions of the smaller MDPs. The decomposition, known as the MAXQ decomposition, has both a procedural semantics--as a subroutine hierarchy--and a declarative semantics--as a representation of the value function of a hierarchical policy. MAXQ unifies and extends previous work on hierarchical reinforcement learning by Singh, Kaelbling, and Dayan and Hinton. It is based on the assumption that the programmer can identify useful subgoals and define subtasks that achieve these subgoals. By defining such subgoals, the programmer constrains the set of policies that need to be considered during reinforcement learning. The MAXQ value function decomposition can represent the value function of any policy that is consistent with the given hierarchy. The decomposition also creates opportunities to exploit state abstractions, so that individual MDPs within the hierarchy can ignore large parts of the state space. This is important for the practical application of the method. This paper defines the MAXQ hierarchy, proves formal results on its representational power, and establishes five conditions for the safe use of state abstractions. The paper presents an online model-free learning algorithm, MAXQ-Q, and proves that it converges with probability 1 to a kind of locally-optimal policy known as a recursively optimal policy, even in the presence of the five kinds of state abstraction. The paper evaluates the MAXQ representation and MAXQ-Q through a series of experiments in three domains and shows experimentally that MAXQ-Q (with state abstractions) converges to a recursively optimal policy much faster than flat Q learning. The fact that MAXQ learns a representation of the value function has an important benefit: it makes it possible to compute and execute an improved, non-hierarchical policy via a procedure similar to the policy improvement step of policy iteration. The paper demonstrates the effectiveness of this nonhierarchical execution experimentally. Finally, the paper concludes with a comparison to related work and a discussion of the design tradeoffs in hierarchical reinforcement learning.

[1]  Ronald A. Howard,et al.  Dynamic Programming and Markov Processes , 1960 .

[2]  Richard Fikes,et al.  Learning and Executing Generalized Robot Plans , 1993, Artif. Intell..

[3]  Earl D. Sacerdoti,et al.  Planning in a Hierarchy of Abstraction Spaces , 1974, IJCAI.

[4]  Charles L. Forgy,et al.  Rete: A Fast Algorithm for the Many Patterns/Many Objects Match Problem , 1982, Artif. Intell..

[5]  Richard E. Korf,et al.  Macro-Operators: A Weak Method for Learning , 1985, Artif. Intell..

[6]  Allen Newell,et al.  SOAR: An Architecture for General Intelligence , 1987, Artif. Intell..

[7]  C. Watkins Learning from delayed rewards , 1989 .

[8]  Judea Pearl,et al.  Probabilistic reasoning in intelligent systems - networks of plausible inference , 1991, Morgan Kaufmann series in representation and reasoning.

[9]  Craig A. Knoblock Learning Abstraction Hierarchies for Problem Solving , 1990, AAAI.

[10]  Charles L. Forgy,et al.  Rete: a fast algorithm for the many pattern/many object pattern match problem , 1991 .

[11]  Austin Tate,et al.  O-Plan: The open Planning Architecture , 1991, Artif. Intell..

[12]  Geoffrey E. Hinton,et al.  Feudal Reinforcement Learning , 1992, NIPS.

[13]  Long-Ji Lin,et al.  Reinforcement learning for robots using neural networks , 1992 .

[14]  Michael I. Jordan,et al.  MASSACHUSETTS INSTITUTE OF TECHNOLOGY ARTIFICIAL INTELLIGENCE LABORATORY and CENTER FOR BIOLOGICAL AND COMPUTATIONAL LEARNING DEPARTMENT OF BRAIN AND COGNITIVE SCIENCES , 1996 .

[15]  Milind Tambe,et al.  Investigating Production System Representations for Non-Combinatorial Match , 1994, Artif. Intell..

[16]  Mahesan Niranjan,et al.  On-line Q-learning using connectionist systems , 1994 .

[17]  Richard S. Sutton,et al.  TD Models: Modeling the World at a Mixture of Time Scales , 1995, ICML.

[18]  Thomas Dean,et al.  Decomposition Techniques for Planning in Stochastic Domains , 1995, IJCAI.

[19]  Dimitri P. Bertsekas,et al.  Dynamic Programming and Optimal Control, Two Volume Set , 1995 .

[20]  Craig Boutilier,et al.  Exploiting Structure in Policy Construction , 1995, IJCAI.

[21]  Andrew G. Barto,et al.  Learning to Act Using Real-Time Dynamic Programming , 1995, Artif. Intell..

[22]  Leemon C. Baird,et al.  Residual Algorithms: Reinforcement Learning with Function Approximation , 1995, ICML.

[23]  John N. Tsitsiklis,et al.  Neuro-Dynamic Programming , 1996, Encyclopedia of Machine Learning.

[24]  Craig Boutilier,et al.  Approximate Value Trees in Structured Dynamic Programming , 1996, ICML.

[25]  Thomas G. Dietterich,et al.  Hierarchical Explanation-Based Reinforcement Learning , 1997, ICML.

[26]  Craig Boutilier,et al.  Economic Principles of Multi-Agent Systems , 1997, Artif. Intell..

[27]  Stuart J. Russell,et al.  Reinforcement Learning with Hierarchies of Machines , 1997, NIPS.

[28]  Doina Precup,et al.  Multi-time Models for Temporally Abstract Planning , 1997, NIPS.

[29]  Csaba Szepesvari,et al.  Module Based Reinforcement Learning for a Real Robot , 1997 .

[30]  Milos Hauskrecht,et al.  Hierarchical Solution of Markov Decision Processes using Macro-actions , 1998, UAI.

[31]  R. Sutton Between MDPs and Semi-MDPs : Learning , Planning , and Representing Knowledge at Multiple Temporal Scales , 1998 .

[32]  Ronald E. Parr,et al.  Hierarchical control and learning for markov decision processes , 1998 .

[33]  Doina Precup,et al.  Between MOPs and Semi-MOP: Learning, Planning & Representing Knowledge at Multiple Temporal Scales , 1998 .

[34]  Thomas G. Dietterich The MAXQ Method for Hierarchical Reinforcement Learning , 1998, ICML.

[35]  Ronald Parr,et al.  Flexible Decomposition Algorithms for Weakly Coupled Markov Decision Problems , 1998, UAI.

[36]  Balaraman Ravindran,et al.  Improved Switching among Temporally Abstract Actions , 1998, NIPS.

[37]  Craig Boutilier,et al.  Decision-Theoretic Planning: Structural Assumptions and Computational Leverage , 1999, J. Artif. Intell. Res..

[38]  Andrew W. Moore,et al.  Multi-Value-Functions: Efficient Automatic Action Hierarchies for Multiple Goal MDPs , 1999, IJCAI.