Reinforcement Learning using Guided Observability

Due to recent breakthroughs, reinforcement learning (RL) has demonstrated impressive performance in challenging sequential decision-making problems. However, an open question is how to make RL cope with partial observability which is prevalent in many real-world problems. Contrary to contemporary RL approaches, which focus mostly on improved memory representations or strong assumptions about the type of partial observability, we propose a simple but efficient approach that can be applied together with a wide variety of RL methods. Our main insight is that smoothly transitioning from full observability to partial observability during the training process yields a high performance policy. The approach, called partially observable guided reinforcement learning (PO-GRL), allows to utilize full state information during policy optimization without compromising the optimality of the final policy. A comprehensive evaluation in discrete partially observable Markov decision process (POMDP) benchmark problems and continuous partially observable MuJoCo and OpenAI gym tasks shows that PO-GRL improves performance. Finally, we demonstrate PO-GRL in the ball-in-the-cup task on a real Barrett WAM robot under partial observability.

[1]  Wojciech M. Czarnecki,et al.  Grandmaster level in StarCraft II using multi-agent reinforcement learning , 2019, Nature.

[2]  Bram Bakker,et al.  Reinforcement Learning with Long Short-Term Memory , 2001, NIPS.

[3]  Peter Stone,et al.  Reinforcement learning , 2019, Scholarpedia.

[4]  Sergey Levine,et al.  Guided Policy Search , 2013, ICML.

[5]  Jürgen Schmidhuber,et al.  Policy Gradient Critics , 2007, ECML.

[6]  Nolan Wagener,et al.  Learning contact-rich manipulation skills with guided policy search , 2015, 2015 IEEE International Conference on Robotics and Automation (ICRA).

[7]  Andrew W. Moore,et al.  Reinforcement Learning: A Survey , 1996, J. Artif. Intell. Res..

[8]  Yuval Tassa,et al.  MuJoCo: A physics engine for model-based control , 2012, 2012 IEEE/RSJ International Conference on Intelligent Robots and Systems.

[9]  Peter Stone,et al.  Deep Recurrent Q-Learning for Partially Observable MDPs , 2015, AAAI Fall Symposia.

[10]  Marcin Andrychowicz,et al.  Asymmetric Actor Critic for Image-Based Robot Learning , 2017, Robotics: Science and Systems.

[11]  Demis Hassabis,et al.  Mastering the game of Go without human knowledge , 2017, Nature.

[12]  Jürgen Schmidhuber,et al.  Long Short-Term Memory , 1997, Neural Computation.

[13]  Gerhard Neumann,et al.  Guided Deep Reinforcement Learning for Swarm Systems , 2017, ArXiv.

[14]  Sergey Levine,et al.  Learning hand-eye coordination for robotic grasping with deep learning and large-scale data collection , 2016, Int. J. Robotics Res..

[15]  Sergey Levine,et al.  Learning Complex Neural Network Policies with Trajectory Optimization , 2014, ICML.

[16]  Joni Pajarinen,et al.  Robotic manipulation of multiple objects as a POMDP , 2014, Artif. Intell..

[17]  Wojciech Zaremba,et al.  OpenAI Gym , 2016, ArXiv.

[18]  David Silver,et al.  Memory-based control with recurrent neural networks , 2015, ArXiv.

[19]  Sergey Levine,et al.  Learning deep neural network policies with continuous memory states , 2015, 2016 IEEE International Conference on Robotics and Automation (ICRA).

[20]  Shimon Whiteson,et al.  Deep Variational Reinforcement Learning for POMDPs , 2018, ICML.

[21]  Guy Shani,et al.  A survey of point-based POMDP solvers , 2013, Autonomous Agents and Multi-Agent Systems.

[22]  Leslie Pack Kaelbling,et al.  Planning and Acting in Partially Observable Stochastic Domains , 1998, Artif. Intell..

[23]  Jürgen Schmidhuber,et al.  Solving Deep Memory POMDPs with Recurrent Policy Gradients , 2007, ICANN.

[24]  Wolfram Burgard,et al.  VLocNet++: Deep Multitask Learning for Semantic Visual Localization and Odometry , 2018, IEEE Robotics and Automation Letters.

[25]  Sergey Levine,et al.  Trust Region Policy Optimization , 2015, ICML.

[26]  Sergey Levine,et al.  End-to-End Training of Deep Visuomotor Policies , 2015, J. Mach. Learn. Res..

[27]  Alec Radford,et al.  Proximal Policy Optimization Algorithms , 2017, ArXiv.

[28]  Sergey Levine,et al.  Guided Policy Search via Approximate Mirror Descent , 2016, NIPS.

[29]  Sergey Levine,et al.  Learning deep control policies for autonomous aerial vehicles with MPC-guided policy search , 2015, 2016 IEEE International Conference on Robotics and Automation (ICRA).

[30]  Reid G. Simmons,et al.  Heuristic Search Value Iteration for POMDPs , 2004, UAI.