Guided Reinforcement Learning Under Partial Observability

Due to recent breakthroughs in artificial intelligence, reinforcement learning (RL) gained considerable attention in the last years. Especially combining RL with deep neural networks, which could already achieve great success in other research fields, has contributed to these breakthroughs. In many real-world scenarios it is not always possible to perceive the true and complete state of the environment. These scenarios are known as learning under uncertainty or under partial observability. Formulating such problems as partially observable Markov decision processes (POMDPs) allows us to solve decision making processes under uncertainty. Guided RL approaches address this problem by supporting the RL algorithms with additional state information during the learning process to increase their performance for solving POMDPs. However, these guided approaches are relatively rare in literature and most existing approaches are modelbased, meaning that they require to learn an appropriate model of the environment first. For this reason, we will propose a novel model-free guided RL approach, called guided reinforcement learning (GRL). The guidance is mainly based on mixing samples containing full or partial state information while we gradually decrease the amount of full state information during training which ends up with a policy compatible with partial observations. The general formulation of our simple GRL approach allows to combine this approach with a variety of existing model-free RL algorithms as well as a variety of settings where these algorithms can be applied to. We demonstrate that our GRL approach can outperform other baseline algorithms that are trained directly on partial observations on nine different tasks. These tasks include two instances of the well-known discrete action space problem RockSample and the continuous action space problem LunarLander-POMDP which is a partially observable modification of the LunarLanderContinuous-v2 environment. Further, the benchmark tasks also include six partially observable tasks that we have constructed based on continuous control problems, simulated in the MuJoCo physics simulator.

[1]  Edward J. Sondik,et al.  The optimal control of par-tially observable Markov processes , 1971 .

[2]  Edward J. Sondik,et al.  The Optimal Control of Partially Observable Markov Processes over a Finite Horizon , 1973, Oper. Res..

[3]  Yoshua Bengio,et al.  An Input Output HMM Architecture , 1994, NIPS.

[4]  Michael L. Littman,et al.  Memoryless policies: theoretical limitations and practical results , 1994 .

[5]  Mahesan Niranjan,et al.  On-line Q-learning using connectionist systems , 1994 .

[6]  Ben J. A. Kröse,et al.  Learning from delayed rewards , 1995, Robotics Auton. Syst..

[7]  Andrew McCallum,et al.  Instance-Based Utile Distinctions for Reinforcement Learning with Hidden State , 1995, ICML.

[8]  Andrew W. Moore,et al.  Reinforcement Learning: A Survey , 1996, J. Artif. Intell. Res..

[9]  Jürgen Schmidhuber,et al.  Long Short-Term Memory , 1997, Neural Computation.

[10]  Leslie Pack Kaelbling,et al.  Planning and Acting in Partially Observable Stochastic Domains , 1998, Artif. Intell..

[11]  Shun-ichi Amari,et al.  Natural Gradient Works Efficiently in Learning , 1998, Neural Computation.

[12]  Andrew W. Moore,et al.  Gradient Descent for General Reinforcement Learning , 1998, NIPS.

[13]  Yishay Mansour,et al.  Policy Gradient Methods for Reinforcement Learning with Function Approximation , 1999, NIPS.

[14]  Kee-Eung Kim,et al.  Learning Finite-State Controllers for Partially Observable Environments , 1999, UAI.

[15]  Bram Bakker,et al.  Reinforcement Learning with Long Short-Term Memory , 2001, NIPS.

[16]  Sham M. Kakade,et al.  A Natural Policy Gradient , 2001, NIPS.

[17]  Joelle Pineau,et al.  Point-based value iteration: An anytime algorithm for POMDPs , 2003, IJCAI.

[18]  Reid G. Simmons,et al.  Heuristic Search Value Iteration for POMDPs , 2004, UAI.

[19]  Ronald J. Williams,et al.  Simple Statistical Gradient-Following Algorithms for Connectionist Reinforcement Learning , 2004, Machine Learning.

[20]  Richard S. Sutton,et al.  Learning to predict by the methods of temporal differences , 1988, Machine Learning.

[21]  Jürgen Schmidhuber,et al.  Solving Deep Memory POMDPs with Recurrent Policy Gradients , 2007, ICANN.

[22]  Jürgen Schmidhuber,et al.  Policy Gradient Critics , 2007, ECML.

[23]  Stefan Schaal,et al.  2008 Special Issue: Reinforcement learning of motor skills with policy gradients , 2008 .

[24]  Eduardo F. Morales,et al.  An Introduction to Reinforcement Learning , 2011 .

[25]  Yuval Tassa,et al.  MuJoCo: A physics engine for model-based control , 2012, 2012 IEEE/RSJ International Conference on Intelligent Robots and Systems.

[26]  Alex Graves,et al.  Playing Atari with Deep Reinforcement Learning , 2013, ArXiv.

[27]  Sergey Levine,et al.  Guided Policy Search , 2013, ICML.

[28]  Jan Peters,et al.  Reinforcement learning in robotics: A survey , 2013, Int. J. Robotics Res..

[29]  Jan Peters,et al.  A Survey on Policy Search for Robotics , 2013, Found. Trends Robotics.

[30]  Sergey Levine,et al.  Learning Complex Neural Network Policies with Trajectory Optimization , 2014, ICML.

[31]  Guy Lever,et al.  Deterministic Policy Gradient Algorithms , 2014, ICML.

[32]  Ming Yang,et al.  DeepFace: Closing the Gap to Human-Level Performance in Face Verification , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[33]  Sergey Levine,et al.  Learning Neural Network Policies with Guided Policy Search under Unknown Dynamics , 2014, NIPS.

[34]  Nils Jansen,et al.  Accelerating Parametric Probabilistic Verification , 2014, QEST.

[35]  Yuval Tassa,et al.  Learning Continuous Control Policies by Stochastic Value Gradients , 2015, NIPS.

[36]  Sergey Levine,et al.  Trust Region Policy Optimization , 2015, ICML.

[37]  David Silver,et al.  Memory-based control with recurrent neural networks , 2015, ArXiv.

[38]  Nolan Wagener,et al.  Learning contact-rich manipulation skills with guided policy search , 2015, 2015 IEEE International Conference on Robotics and Automation (ICRA).

[39]  Shane Legg,et al.  Human-level control through deep reinforcement learning , 2015, Nature.

[40]  Philip S. Thomas,et al.  Safe Reinforcement Learning , 2015 .

[41]  Peter Stone,et al.  Deep Recurrent Q-Learning for Partially Observable MDPs , 2015, AAAI Fall Symposia.

[42]  Marc G. Bellemare,et al.  The Arcade Learning Environment: An Evaluation Platform for General Agents (Extended Abstract) , 2012, IJCAI.

[43]  Yuval Tassa,et al.  Continuous control with deep reinforcement learning , 2015, ICLR.

[44]  Sergey Levine,et al.  Learning deep control policies for autonomous aerial vehicles with MPC-guided policy search , 2015, 2016 IEEE International Conference on Robotics and Automation (ICRA).

[45]  Sergey Levine,et al.  Learning deep neural network policies with continuous memory states , 2015, 2016 IEEE International Conference on Robotics and Automation (ICRA).

[46]  Guigang Zhang,et al.  Deep Learning , 2016, Int. J. Semantic Comput..

[47]  Sergey Levine,et al.  End-to-End Training of Deep Visuomotor Policies , 2015, J. Mach. Learn. Res..

[48]  Sergey Levine,et al.  Guided Policy Search via Approximate Mirror Descent , 2016, NIPS.

[49]  Wojciech Zaremba,et al.  OpenAI Gym , 2016, ArXiv.

[50]  Anil A. Bharath,et al.  Deep Reinforcement Learning: A Brief Survey , 2017, IEEE Signal Processing Magazine.

[51]  Peter Stone,et al.  Reinforcement learning , 2019, Scholarpedia.

[52]  Demis Hassabis,et al.  Mastering the game of Go without human knowledge , 2017, Nature.

[53]  Gerhard Neumann,et al.  Guided Deep Reinforcement Learning for Swarm Systems , 2017, ArXiv.

[54]  Alec Radford,et al.  Proximal Policy Optimization Algorithms , 2017, ArXiv.

[55]  Sergey Levine,et al.  Soft Actor-Critic: Off-Policy Maximum Entropy Deep Reinforcement Learning with a Stochastic Actor , 2018, ICML.

[56]  Joseph Futoma,et al.  Prediction-Constrained POMDPs , 2018 .

[57]  Joelle Pineau,et al.  Recurrent Value Functions , 2019, ArXiv.

[58]  Jan Peters,et al.  Compatible natural gradient policy search , 2019, Machine Learning.

[59]  Nils Jansen,et al.  Counterexample-Guided Strategy Improvement for POMDPs Using Recurrent Neural Networks , 2019, IJCAI.

[60]  Ming-Wei Chang,et al.  BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.