Situationally Aware Options

Hierarchical abstractions, also known as options -- a type of temporally extended action (Sutton et. al. 1999) that enables a reinforcement learning agent to plan at a higher level, abstracting away from the lower-level details. In this work, we learn reusable options whose parameters can vary, encouraging different behaviors, based on the current situation. In principle, these behaviors can include vigor, defence or even risk-averseness. These are some examples of what we refer to in the broader context as Situational Awareness (SA). We incorporate SA, in the form of vigor, into hierarchical RL by defining and learning situationally aware options in a Probabilistic Goal Semi-Markov Decision Process (PG-SMDP). This is achieved using our Situationally Aware oPtions (SAP) policy gradient algorithm which comes with a theoretical convergence guarantee. We learn reusable options in different scenarios in a RoboCup soccer domain (i.e., winning/losing). These options learn to execute with different levels of vigor resulting in human-like behaviours such as `time-wasting' in the winning scenario. We show the potential of the agent to exit bad local optima using reusable options in RoboCup. Finally, using SAP, the agent mitigates feature-based model misspecification in a Bottomless Pit of Death domain.

[1]  Vivek S. Borkar,et al.  A sensitivity formula for risk-sensitive cost and the actor-critic algorithm , 2001, Syst. Control. Lett..

[2]  A. Yiannakos,et al.  Evaluation of the goal scoring patterns in European Championship in Portugal 2004. , 2006 .

[3]  Scott Kuindersma,et al.  Variable risk control via stochastic optimization , 2013, Int. J. Robotics Res..

[4]  Laurent El Ghaoui,et al.  Robust Control of Markov Decision Processes with Uncertain Transition Matrices , 2005, Oper. Res..

[5]  R. Dolan,et al.  Dopamine-dependent prediction errors underpin reward-seeking behaviour in humans , 2006, Nature.

[6]  Shane Legg,et al.  Human-level control through deep reinforcement learning , 2015, Nature.

[7]  Robert Givan,et al.  Bounded Parameter Markov Decision Processes , 1997, ECP.

[8]  Shie Mannor,et al.  Adaptive Skills Adaptive Partitions (ASAP) , 2016, NIPS.

[9]  Shie Mannor,et al.  Policy Gradient for Coherent Risk Measures , 2015, NIPS.

[10]  George Konidaris,et al.  Value Function Approximation in Reinforcement Learning Using the Fourier Basis , 2011, AAAI.

[11]  Bruno Castro da Silva,et al.  Learning Parameterized Skills , 2012, ICML.

[12]  Hani Hagras,et al.  A hierarchical type-2 fuzzy logic control architecture for autonomous mobile robots , 2004, IEEE Transactions on Fuzzy Systems.

[13]  P. Dayan,et al.  Tonic dopamine: opportunity costs and the control of response vigor , 2007, Psychopharmacology.

[14]  Shie Mannor,et al.  Time-regularized interrupting options , 2014, ICML 2014.

[15]  J. Salamone,et al.  Activational and effort-related aspects of motivation: neural mechanisms and implications for psychopathology. , 2016, Brain : a journal of neurology.

[16]  V. Borkar Stochastic approximation with two time scales , 1997 .

[17]  Doina Precup,et al.  The Option-Critic Architecture , 2016, AAAI.

[18]  Shie Mannor,et al.  Optimizing the CVaR via Sampling , 2014, AAAI.

[19]  Chelsea C. White,et al.  Markov Decision Processes with Imprecise Transition Probabilities , 1994, Oper. Res..

[20]  Richard S. Sutton,et al.  Reinforcement Learning: An Introduction , 1998, IEEE Trans. Neural Networks.

[21]  Alex M. Andrew,et al.  Reinforcement Learning: : An Introduction , 1998 .

[22]  Shie Mannor,et al.  Iterative Hierarchical Optimization for Misspecified Problems (IHOMP) , 2016, ArXiv.

[23]  Shie Mannor,et al.  Risk-Sensitive and Robust Decision-Making: a CVaR Optimization Approach , 2015, NIPS.

[24]  Wyatt Newman Team CASE and the 2007 DARPA Urban Challenge , 2007 .

[25]  A. Tversky,et al.  Prospect theory: an analysis of decision under risk — Source link , 2007 .

[26]  Doina Precup,et al.  Multi-time Models for Temporally Abstract Planning , 1997, NIPS.

[27]  Goldie Nejat,et al.  Multirobot Cooperative Learning for Semiautonomous Control in Urban Search and Rescue Applications , 2016, J. Field Robotics.

[28]  Xiaoping Chen,et al.  Online Planning for Large Markov Decision Processes with Hierarchical Decomposition , 2015, ACM Trans. Intell. Syst. Technol..

[29]  E. Fernandez-Gaucherand,et al.  Controlled Markov chains with exponential risk-sensitive criteria: modularity, structured policies and applications , 1998, Proceedings of the 37th IEEE Conference on Decision and Control (Cat. No.98CH36171).

[30]  Peter Stone,et al.  Deep Reinforcement Learning in Parameterized Action Space , 2015, ICLR.

[31]  P. Dayan Instrumental vigour in punishment and reward , 2012, The European journal of neuroscience.

[32]  Tomoharu Nakashima,et al.  HELIOS Base: An Open Source Package for the RoboCup Soccer 2D Simulation , 2013, RoboCup.

[33]  Lihong Li,et al.  PAC-inspired Option Discovery in Lifelong Reinforcement Learning , 2014, ICML.

[34]  P. Dayan,et al.  Safety out of control: dopamine and defence , 2016, Behavioral and Brain Functions.

[35]  Shie Mannor,et al.  Probabilistic Goal Markov Decision Processes , 2011, IJCAI.

[36]  Stefan Schaal,et al.  Policy Gradient Methods for Robotics , 2006, 2006 IEEE/RSJ International Conference on Intelligent Robots and Systems.

[37]  Stefan Schaal,et al.  2008 Special Issue: Reinforcement learning of motor skills with policy gradients , 2008 .

[38]  A. Dagher,et al.  The role of dopamine in risk taking: a specific look at Parkinson’s disease and gambling , 2014, Front. Behav. Neurosci..

[39]  E. Oleson,et al.  A role for phasic dopamine release within the nucleus accumbens in encoding aversion: a review of the neurochemical literature. , 2015, ACS chemical neuroscience.

[40]  Shie Mannor,et al.  Learning When to Switch between Skills in a High Dimensional Domain , 2015, AAAI Workshop: Learning for General Competency in Video Games.

[41]  Doina Precup,et al.  Between MDPs and Semi-MDPs: A Framework for Temporal Abstraction in Reinforcement Learning , 1999, Artif. Intell..

[42]  Shie Mannor,et al.  Policy Gradients with Variance Related Risk Criteria , 2012, ICML.

[43]  Joel Z. Leibo,et al.  Multi-agent Reinforcement Learning in Sequential Social Dilemmas , 2017, AAMAS.