Learning When to Switch between Skills in a High Dimensional Domain

Introduction Complex problems are often easier to model with skills1 than primitive (single time-step) actions (Stone, Sutton, and Kuhlmann 2005). In addition, skills have been shown to speed up the convergence rates of planning algorithms both experimentally (Sutton, Precup, and Singh 1999; Silver and Ciosek 2012) and theoretically (Mann and Mannor 2014). Skills are generally designed by a domain expert, but designing a ‘good’ set of skills can be challenging in highdimensional, complex domains. In some cases, the skills may contain useful prior knowledge but cannot solve the task, resulting in a sub-optimal solution or no solution at all. Given a ‘poor’ set of skills, we would like to dynamically improve them. Sutton, Precup, and Singh 1999 suggest Interrupting Options (IO) whereby the agent switches skills whenever there exists another skill with higher value than continuing the current skill. Mankowitz, Mann, and Mannor 2014 prove that the IO process converges on the set of skills with optimal switching rules (under mild assumptions). While the potential of IO is promising, previous experiments only showed the advantage of IO in tasks with few states. We experiment with Space Invaders (SI, Figure 1) via the Arcade Learning Environment (ALE) using the 1024-bit RAM image as the state (Bellemare et al. 2013). The primitive actions are combinations of move left, move right, do nothing, and shoot. Our hypothesis is that IOs can improve performance compared to learning with a fixed set of skills despite the fact that SI has a large state-space. To test our hypothesis, we constructed a naive set of skills for the domain where each skill repeats one of the primitive actions over-and-over until a termination condition occurs. However, designing a rule that determines the optimal time to switch between skills is challenging, so we gave each skill a constant probability of switching at each timestep. We call the set of skills with this naive termination rule the initial skills set, and our experiments compared learning with the initial skill set to learning with a skill set whose switching rules are dynamically adapted by the IO rule.