论文信息 - Learning When to Switch between Skills in a High Dimensional Domain

Learning When to Switch between Skills in a High Dimensional Domain

Introduction Complex problems are often easier to model with skills1 than primitive (single time-step) actions (Stone, Sutton, and Kuhlmann 2005). In addition, skills have been shown to speed up the convergence rates of planning algorithms both experimentally (Sutton, Precup, and Singh 1999; Silver and Ciosek 2012) and theoretically (Mann and Mannor 2014). Skills are generally designed by a domain expert, but designing a ‘good’ set of skills can be challenging in highdimensional, complex domains. In some cases, the skills may contain useful prior knowledge but cannot solve the task, resulting in a sub-optimal solution or no solution at all. Given a ‘poor’ set of skills, we would like to dynamically improve them. Sutton, Precup, and Singh 1999 suggest Interrupting Options (IO) whereby the agent switches skills whenever there exists another skill with higher value than continuing the current skill. Mankowitz, Mann, and Mannor 2014 prove that the IO process converges on the set of skills with optimal switching rules (under mild assumptions). While the potential of IO is promising, previous experiments only showed the advantage of IO in tasks with few states. We experiment with Space Invaders (SI, Figure 1) via the Arcade Learning Environment (ALE) using the 1024-bit RAM image as the state (Bellemare et al. 2013). The primitive actions are combinations of move left, move right, do nothing, and shoot. Our hypothesis is that IOs can improve performance compared to learning with a fixed set of skills despite the fact that SI has a large state-space. To test our hypothesis, we constructed a naive set of skills for the domain where each skill repeats one of the primitive actions over-and-over until a termination condition occurs. However, designing a rule that determines the optimal time to switch between skills is challenging, so we gave each skill a constant probability of switching at each timestep. We call the set of skills with this naive termination rule the initial skills set, and our experiments compared learning with the initial skill set to learning with a skill set whose switching rules are dynamically adapted by the IO rule.

Shie Mannor | Timothy A. Mann | Daniel J. Mankowitz | Shie Mannor | D. Mankowitz

[1] Shie Mannor,et al. Time-regularized interrupting options , 2014, ICML 2014.

[2] Shie Mannor,et al. Scaling Up Approximate Value Iteration with Options: Better Policies with Fewer Iterations , 2014, ICML.

[3] Peter Stone,et al. Reinforcement Learning for RoboCup Soccer Keepaway , 2005, Adapt. Behav..

[4] Marc G. Bellemare,et al. The Arcade Learning Environment: An Evaluation Platform for General Agents (Extended Abstract) , 2012, IJCAI.

[5] Doina Precup,et al. Between MDPs and Semi-MDPs: A Framework for Temporal Abstraction in Reinforcement Learning , 1999, Artif. Intell..

[6] Chris Watkins,et al. Learning from delayed rewards , 1989 .

[7] Richard S. Sutton,et al. Reinforcement Learning: An Introduction , 1998, IEEE Trans. Neural Networks.

[8] Alex Graves,et al. Playing Atari with Deep Reinforcement Learning , 2013, ArXiv.

[9] David Silver,et al. Compositional Planning Using Optimal Option Models , 2012, ICML.

[10] Risto Miikkulainen,et al. A Neuroevolution Approach to General Atari Game Playing , 2014, IEEE Transactions on Computational Intelligence and AI in Games.