HQ-Learning: Discovering Markovian Subgoals for Non-Markovian Reinforcement Learning

To solve partially observable Markov decision problems, we introduce HQ-learning, a hierarchical extension of Q-learning. HQ-learning is based on an ordered sequence of subagents, each learning to identify and solve a Markovian subtask of the total task. Each agent learns (1) an appropriate subgoal (though there is no intermediate, external reinforcement for "good" subgoals), and (2) a Markovian policy, given a particular subgoal. Our experiments demonstrate: (a) The system can easily solve tasks standard Q-learning cannot solve at all. (b) It can solve partially observable mazes with more states than those used in most previous POMDP work. (c) It can quickly solve complex tasks that require manipulation of the environment to free a blocked path to the goal.