Situational Awareness by Risk-Conscious Skills

Hierarchical Reinforcement Learning has been previously shown to speed up the convergence rate of RL planning algorithms as well as mitigate feature-based model misspecification (Mankowitz et. al. 2016a,b, Bacon 2015). To do so, it utilizes hierarchical abstractions, also known as skills -- a type of temporally extended action (Sutton et. al. 1999) to plan at a higher level, abstracting away from the lower-level details. We incorporate risk sensitivity, also referred to as Situational Awareness (SA), into hierarchical RL for the first time by defining and learning risk aware skills in a Probabilistic Goal Semi-Markov Decision Process (PG-SMDP). This is achieved using our novel Situational Awareness by Risk-Conscious Skills (SARiCoS) algorithm which comes with a theoretical convergence guarantee. We show in a RoboCup soccer domain that the learned risk aware skills exhibit complex human behaviors such as `time-wasting' in a soccer game. In addition, the learned risk aware skills are able to mitigate reward-based model misspecification.

[1]  Stefan Schaal,et al.  Policy Gradient Methods for Robotics , 2006, 2006 IEEE/RSJ International Conference on Intelligent Robots and Systems.

[2]  Bruno Castro da Silva,et al.  Learning Parameterized Skills , 2012, ICML.

[3]  Stefan Schaal,et al.  2008 Special Issue: Reinforcement learning of motor skills with policy gradients , 2008 .

[4]  Pravesh Ranchod,et al.  Reinforcement Learning with Parameterized Actions , 2015, AAAI.

[5]  Xiaoping Chen,et al.  Online Planning for Large Markov Decision Processes with Hierarchical Decomposition , 2015, ACM Trans. Intell. Syst. Technol..

[6]  Shie Mannor,et al.  Probabilistic Goal Markov Decision Processes , 2011, IJCAI.

[7]  Hani Hagras,et al.  A hierarchical type-2 fuzzy logic control architecture for autonomous mobile robots , 2004, IEEE Transactions on Fuzzy Systems.

[8]  Shie Mannor,et al.  Learning When to Switch between Skills in a High Dimensional Domain , 2015, AAAI Workshop: Learning for General Competency in Video Games.

[9]  Tomoharu Nakashima,et al.  HELIOS Base: An Open Source Package for the RoboCup Soccer 2D Simulation , 2013, RoboCup.

[10]  Lihong Li,et al.  PAC-inspired Option Discovery in Lifelong Reinforcement Learning , 2014, ICML.

[11]  Shie Mannor,et al.  Time-regularized interrupting options , 2014, ICML 2014.

[12]  Peter Stone,et al.  Deep Reinforcement Learning in Parameterized Action Space , 2015, ICLR.

[13]  Shie Mannor,et al.  Adaptive Skills Adaptive Partitions (ASAP) , 2016, NIPS.

[14]  A. Yiannakos,et al.  Evaluation of the goal scoring patterns in European Championship in Portugal 2004. , 2006 .

[15]  Richard S. Sutton,et al.  Reinforcement Learning: An Introduction , 1998, IEEE Trans. Neural Networks.

[16]  Mica R. Endsley,et al.  Toward a Theory of Situation Awareness in Dynamic Systems , 1995, Hum. Factors.

[17]  Doina Precup,et al.  Between MDPs and Semi-MDPs: A Framework for Temporal Abstraction in Reinforcement Learning , 1999, Artif. Intell..

[18]  Shie Mannor,et al.  Policy Gradient for Coherent Risk Measures , 2015, NIPS.

[19]  V. Borkar Stochastic approximation with two time scales , 1997 .

[20]  Alex M. Andrew,et al.  Reinforcement Learning: : An Introduction , 1998 .

[21]  Goldie Nejat,et al.  Multirobot Cooperative Learning for Semiautonomous Control in Urban Search and Rescue Applications , 2016, J. Field Robotics.

[22]  Shane Legg,et al.  Human-level control through deep reinforcement learning , 2015, Nature.

[23]  Andrew G. Barto,et al.  PolicyBlocks: An Algorithm for Creating Useful Macro-Actions in Reinforcement Learning , 2002, ICML.

[24]  Doina Precup,et al.  Multi-time Models for Temporally Abstract Planning , 1997, NIPS.

[25]  E. Fernandez-Gaucherand,et al.  Controlled Markov chains with exponential risk-sensitive criteria: modularity, structured policies and applications , 1998, Proceedings of the 37th IEEE Conference on Decision and Control (Cat. No.98CH36171).

[26]  Shie Mannor,et al.  Iterative Hierarchical Optimization for Misspecified Problems (IHOMP) , 2016, ArXiv.

[27]  Sebastian Thrun,et al.  Lifelong robot learning , 1993, Robotics Auton. Syst..

[28]  Kip Smith,et al.  Situation Awareness Is Adaptive, Externally Directed Consciousness , 1995, Hum. Factors.

[29]  Doina Precup,et al.  The Option-Critic Architecture , 2016, AAAI.

[30]  Shie Mannor,et al.  Optimizing the CVaR via Sampling , 2014, AAAI.