论文信息 - Reinforcement Learning using Guided Observability

Reinforcement Learning using Guided Observability

Due to recent breakthroughs, reinforcement learning (RL) has demonstrated impressive performance in challenging sequential decision-making problems. However, an open question is how to make RL cope with partial observability which is prevalent in many real-world problems. Contrary to contemporary RL approaches, which focus mostly on improved memory representations or strong assumptions about the type of partial observability, we propose a simple but efficient approach that can be applied together with a wide variety of RL methods. Our main insight is that smoothly transitioning from full observability to partial observability during the training process yields a high performance policy. The approach, called partially observable guided reinforcement learning (PO-GRL), allows to utilize full state information during policy optimization without compromising the optimality of the final policy. A comprehensive evaluation in discrete partially observable Markov decision process (POMDP) benchmark problems and continuous partially observable MuJoCo and OpenAI gym tasks shows that PO-GRL improves performance. Finally, we demonstrate PO-GRL in the ball-in-the-cup task on a real Barrett WAM robot under partial observability.

[1] Wojciech M. Czarnecki,et al. Grandmaster level in StarCraft II using multi-agent reinforcement learning , 2019, Nature.

[2] Bram Bakker,et al. Reinforcement Learning with Long Short-Term Memory , 2001, NIPS.

[3] Peter Stone,et al. Reinforcement learning , 2019, Scholarpedia.

[4] Sergey Levine,et al. Guided Policy Search , 2013, ICML.

[5] Jürgen Schmidhuber,et al. Policy Gradient Critics , 2007, ECML.

[6] Nolan Wagener,et al. Learning contact-rich manipulation skills with guided policy search , 2015, 2015 IEEE International Conference on Robotics and Automation (ICRA).

[7] Andrew W. Moore,et al. Reinforcement Learning: A Survey , 1996, J. Artif. Intell. Res..

[8] Yuval Tassa,et al. MuJoCo: A physics engine for model-based control , 2012, 2012 IEEE/RSJ International Conference on Intelligent Robots and Systems.

[9] Peter Stone,et al. Deep Recurrent Q-Learning for Partially Observable MDPs , 2015, AAAI Fall Symposia.

[10] Marcin Andrychowicz,et al. Asymmetric Actor Critic for Image-Based Robot Learning , 2017, Robotics: Science and Systems.

[11] Demis Hassabis,et al. Mastering the game of Go without human knowledge , 2017, Nature.

[12] Jürgen Schmidhuber,et al. Long Short-Term Memory , 1997, Neural Computation.

[13] Gerhard Neumann,et al. Guided Deep Reinforcement Learning for Swarm Systems , 2017, ArXiv.

[14] Sergey Levine,et al. Learning hand-eye coordination for robotic grasping with deep learning and large-scale data collection , 2016, Int. J. Robotics Res..

[15] Sergey Levine,et al. Learning Complex Neural Network Policies with Trajectory Optimization , 2014, ICML.

[16] Joni Pajarinen,et al. Robotic manipulation of multiple objects as a POMDP , 2014, Artif. Intell..

[17] Wojciech Zaremba,et al. OpenAI Gym , 2016, ArXiv.

[18] David Silver,et al. Memory-based control with recurrent neural networks , 2015, ArXiv.

[19] Sergey Levine,et al. Learning deep neural network policies with continuous memory states , 2015, 2016 IEEE International Conference on Robotics and Automation (ICRA).

[20] Shimon Whiteson,et al. Deep Variational Reinforcement Learning for POMDPs , 2018, ICML.

[21] Guy Shani,et al. A survey of point-based POMDP solvers , 2013, Autonomous Agents and Multi-Agent Systems.

[22] Leslie Pack Kaelbling,et al. Planning and Acting in Partially Observable Stochastic Domains , 1998, Artif. Intell..

[23] Jürgen Schmidhuber,et al. Solving Deep Memory POMDPs with Recurrent Policy Gradients , 2007, ICANN.

[24] Wolfram Burgard,et al. VLocNet++: Deep Multitask Learning for Semantic Visual Localization and Odometry , 2018, IEEE Robotics and Automation Letters.

[25] Sergey Levine,et al. Trust Region Policy Optimization , 2015, ICML.

[26] Sergey Levine,et al. End-to-End Training of Deep Visuomotor Policies , 2015, J. Mach. Learn. Res..

[27] Alec Radford,et al. Proximal Policy Optimization Algorithms , 2017, ArXiv.

[28] Sergey Levine,et al. Guided Policy Search via Approximate Mirror Descent , 2016, NIPS.

[29] Sergey Levine,et al. Learning deep control policies for autonomous aerial vehicles with MPC-guided policy search , 2015, 2016 IEEE International Conference on Robotics and Automation (ICRA).

[30] Reid G. Simmons,et al. Heuristic Search Value Iteration for POMDPs , 2004, UAI.