Deep Reinforcement Learning Discovers Internal Models

Deep Reinforcement Learning (DRL) is a trending field of research, showing great promise in challenging problems such as playing Atari, solving Go and controlling robots. While DRL agents perform well in practice we are still lacking the tools to analayze their performance. In this work we present the Semi-Aggregated MDP (SAMDP) model. A model best suited to describe policies exhibiting both spatial and temporal hierarchies. We describe its advantages for analyzing trained policies over other modeling approaches, and show that under the right state representation, like that of DQN agents, SAMDP can help to identify skills. We detail the automatic process of creating it from recorded trajectories, up to presenting it on t-SNE maps. We explain how to evaluate its fitness and show surprising results indicating high compatibility with the policy at hand. We conclude by showing how using the SAMDP model, an extra performance gain can be squeezed from the agent.

[1]  J. Rissanen,et al.  Modeling By Shortest Data Description* , 1978, Autom..

[2]  Ronald Parr,et al.  Flexible Decomposition Algorithms for Weakly Coupled Markov Decision Problems , 1998, UAI.

[3]  Eduardo D. Sontag,et al.  Adaptation and regulation with signal detection implies internal model , 2003, Syst. Control. Lett..

[4]  Rob Fergus,et al.  Visualizing and Understanding Convolutional Networks , 2013, ECCV.

[5]  Long-Ji Lin,et al.  Reinforcement learning for robots using neural networks , 1992 .

[6]  Benjamin Pitzer,et al.  Towards perceptual shared autonomy for robotic mobile manipulation , 2011, 2011 IEEE International Conference on Robotics and Automation.

[7]  Doina Precup,et al.  Learning Options in Reinforcement Learning , 2002, SARA.

[8]  J. H. Ward Hierarchical Grouping to Optimize an Objective Function , 1963 .

[9]  Shane Legg,et al.  Human-level control through deep reinforcement learning , 2015, Nature.

[10]  Geoffrey E. Hinton,et al.  ImageNet classification with deep convolutional neural networks , 2012, Commun. ACM.

[11]  Pascal Vincent,et al.  Visualizing Higher-Layer Features of a Deep Network , 2009 .

[12]  Razvan Pascanu,et al.  Policy Distillation , 2015, ICLR.

[13]  D. Bertsekas,et al.  Adaptive aggregation methods for infinite horizon dynamic programming , 1989 .

[14]  Shie Mannor,et al.  Model selection in markovian processes , 2013, KDD.

[15]  Joshua B. Tenenbaum,et al.  Hierarchical Deep Reinforcement Learning: Integrating Temporal Abstraction and Intrinsic Motivation , 2016, NIPS.

[16]  Shie Mannor,et al.  Graying the black box: Understanding DQNs , 2016, ICML.

[17]  Thomas G. Dietterich Hierarchical Reinforcement Learning with the MAXQ Value Function Decomposition , 1999, J. Artif. Intell. Res..

[18]  Doina Precup,et al.  Between MDPs and Semi-MDPs: A Framework for Temporal Abstraction in Reinforcement Learning , 1999, Artif. Intell..

[19]  W. Wonham,et al.  The internal model principle for linear multivariable regulators , 1975 .

[20]  Yoshua Bengio,et al.  How transferable are features in deep neural networks? , 2014, NIPS.

[21]  Shie Mannor,et al.  A Deep Hierarchical Approach to Lifelong Learning in Minecraft , 2016, AAAI.

[22]  Geoffrey E. Hinton,et al.  Visualizing Data using t-SNE , 2008 .

[23]  W. Wonham,et al.  The internal model principle for linear multivariable regulators , 1975 .

[24]  J. MacQueen Some methods for classification and analysis of multivariate observations , 1967 .