论文信息 - Situational Awareness by Risk-Conscious Skills

Situational Awareness by Risk-Conscious Skills

Hierarchical Reinforcement Learning has been previously shown to speed up the convergence rate of RL planning algorithms as well as mitigate feature-based model misspecification (Mankowitz et. al. 2016a,b, Bacon 2015). To do so, it utilizes hierarchical abstractions, also known as skills -- a type of temporally extended action (Sutton et. al. 1999) to plan at a higher level, abstracting away from the lower-level details. We incorporate risk sensitivity, also referred to as Situational Awareness (SA), into hierarchical RL for the first time by defining and learning risk aware skills in a Probabilistic Goal Semi-Markov Decision Process (PG-SMDP). This is achieved using our novel Situational Awareness by Risk-Conscious Skills (SARiCoS) algorithm which comes with a theoretical convergence guarantee. We show in a RoboCup soccer domain that the learned risk aware skills exhibit complex human behaviors such as `time-wasting' in a soccer game. In addition, the learned risk aware skills are able to mitigate reward-based model misspecification.

[1] Stefan Schaal,et al. Policy Gradient Methods for Robotics , 2006, 2006 IEEE/RSJ International Conference on Intelligent Robots and Systems.

[2] Bruno Castro da Silva,et al. Learning Parameterized Skills , 2012, ICML.

[3] Stefan Schaal,et al. 2008 Special Issue: Reinforcement learning of motor skills with policy gradients , 2008 .

[4] Pravesh Ranchod,et al. Reinforcement Learning with Parameterized Actions , 2015, AAAI.

[5] Xiaoping Chen,et al. Online Planning for Large Markov Decision Processes with Hierarchical Decomposition , 2015, ACM Trans. Intell. Syst. Technol..

[6] Shie Mannor,et al. Probabilistic Goal Markov Decision Processes , 2011, IJCAI.

[7] Hani Hagras,et al. A hierarchical type-2 fuzzy logic control architecture for autonomous mobile robots , 2004, IEEE Transactions on Fuzzy Systems.

[8] Shie Mannor,et al. Learning When to Switch between Skills in a High Dimensional Domain , 2015, AAAI Workshop: Learning for General Competency in Video Games.

[9] Tomoharu Nakashima,et al. HELIOS Base: An Open Source Package for the RoboCup Soccer 2D Simulation , 2013, RoboCup.

[10] Lihong Li,et al. PAC-inspired Option Discovery in Lifelong Reinforcement Learning , 2014, ICML.

[11] Shie Mannor,et al. Time-regularized interrupting options , 2014, ICML 2014.

[12] Peter Stone,et al. Deep Reinforcement Learning in Parameterized Action Space , 2015, ICLR.

[13] Shie Mannor,et al. Adaptive Skills Adaptive Partitions (ASAP) , 2016, NIPS.

[14] A. Yiannakos,et al. Evaluation of the goal scoring patterns in European Championship in Portugal 2004. , 2006 .

[15] Richard S. Sutton,et al. Reinforcement Learning: An Introduction , 1998, IEEE Trans. Neural Networks.

[16] Mica R. Endsley,et al. Toward a Theory of Situation Awareness in Dynamic Systems , 1995, Hum. Factors.

[17] Doina Precup,et al. Between MDPs and Semi-MDPs: A Framework for Temporal Abstraction in Reinforcement Learning , 1999, Artif. Intell..

[18] Shie Mannor,et al. Policy Gradient for Coherent Risk Measures , 2015, NIPS.

[19] V. Borkar. Stochastic approximation with two time scales , 1997 .

[20] Alex M. Andrew,et al. Reinforcement Learning: : An Introduction , 1998 .

[21] Goldie Nejat,et al. Multirobot Cooperative Learning for Semiautonomous Control in Urban Search and Rescue Applications , 2016, J. Field Robotics.

[22] Shane Legg,et al. Human-level control through deep reinforcement learning , 2015, Nature.

[23] Andrew G. Barto,et al. PolicyBlocks: An Algorithm for Creating Useful Macro-Actions in Reinforcement Learning , 2002, ICML.

[24] Doina Precup,et al. Multi-time Models for Temporally Abstract Planning , 1997, NIPS.

[25] E. Fernandez-Gaucherand,et al. Controlled Markov chains with exponential risk-sensitive criteria: modularity, structured policies and applications , 1998, Proceedings of the 37th IEEE Conference on Decision and Control (Cat. No.98CH36171).

[26] Shie Mannor,et al. Iterative Hierarchical Optimization for Misspecified Problems (IHOMP) , 2016, ArXiv.

[27] Sebastian Thrun,et al. Lifelong robot learning , 1993, Robotics Auton. Syst..

[28] Kip Smith,et al. Situation Awareness Is Adaptive, Externally Directed Consciousness , 1995, Hum. Factors.

[29] Doina Precup,et al. The Option-Critic Architecture , 2016, AAAI.

[30] Shie Mannor,et al. Optimizing the CVaR via Sampling , 2014, AAAI.