论文信息 - Hierarchical Reinforcement Learning: Approximating Optimal Discounted TSP Using Local Policies

Hierarchical Reinforcement Learning: Approximating Optimal Discounted TSP Using Local Policies

In this work, we provide theoretical guarantees for reward decomposition in deterministic MDPs. Reward decomposition is a special case of Hierarchical Reinforcement Learning, that allows one to learn many policies in parallel and combine them into a composite solution. Our approach builds on mapping this problem into a Reward Discounted Traveling Salesman Problem, and then deriving approximate solutions for it. In particular, we focus on approximate solutions that are local, i.e., solutions that only observe information about the current state. Local policies are easy to implement and do not require substantial computational resources as they do not perform planning. While local deterministic policies, like Nearest Neighbor, are being used in practice for hierarchical reinforcement learning, we propose three stochastic policies that guarantee better performance than any deterministic policy.

[1] Joshua B. Tenenbaum,et al. Hierarchical Deep Reinforcement Learning: Integrating Temporal Abstraction and Intrinsic Motivation , 2016, NIPS.

[2] Tom Schaul,et al. FeUdal Networks for Hierarchical Reinforcement Learning , 2017, ICML.

[3] Shane Legg,et al. Human-level control through deep reinforcement learning , 2015, Nature.

[4] Demis Hassabis,et al. Mastering Chess and Shogi by Self-Play with a General Reinforcement Learning Algorithm , 2017, ArXiv.

[5] Gerald Tesauro,et al. Temporal difference learning and TD-Gammon , 1995, CACM.

[6] Jonas Karlsson,et al. Learning to Solve Multiple Goals , 1997 .

[7] Stuart J. Russell,et al. Q-Decomposition for Reinforcement Learning Agents , 2003, ICML.

[8] Honglak Lee,et al. Action-Conditional Video Prediction using Deep Networks in Atari Games , 2015, NIPS.

[9] Richard Bellman,et al. Dynamic Programming Treatment of the Travelling Salesman Problem , 1962, JACM.

[10] Doina Precup,et al. Between MDPs and Semi-MDPs: A Framework for Temporal Abstraction in Reinforcement Learning , 1999, Artif. Intell..

[11] Asaf Levin,et al. Discounted Reward TSP , 2016, Algorithmica.

[12] Shie Mannor,et al. Approximate Value Iteration with Temporally Extended Actions , 2015, J. Artif. Intell. Res..

[13] Yee Whye Teh,et al. Distral: Robust multitask reinforcement learning , 2017, NIPS.

[14] Tom Schaul,et al. Successor Features for Transfer in Reinforcement Learning , 2016, NIPS.

[15] Sridhar Mahadevan,et al. Recent Advances in Hierarchical Reinforcement Learning , 2003, Discret. Event Dyn. Syst..

[16] Marc G. Bellemare,et al. The Arcade Learning Environment: An Evaluation Platform for General Agents (Extended Abstract) , 2012, IJCAI.

[17] Daniel J. Rosenkrantz,et al. An Analysis of Several Heuristics for the Traveling Salesman Problem , 1977, SIAM J. Comput..

[18] Maja J. Matarić,et al. Action Selection methods using Reinforcement Learning , 1996 .

[19] Dana H. Ballard,et al. Multiple-Goal Reinforcement Learning with Modular Sarsa(0) , 2003, IJCAI.

[20] M. Held,et al. A dynamic programming approach to sequencing problems , 1962, ACM National Meeting.

[21] Honglak Lee,et al. Zero-Shot Task Generalization with Multi-Task Deep Reinforcement Learning , 2017, ICML.

[22] Andrew Chi-Chih Yao,et al. Probabilistic computations: Toward a unified measure of complexity , 1977, 18th Annual Symposium on Foundations of Computer Science (sfcs 1977).

[23] Demis Hassabis,et al. Mastering the game of Go with deep neural networks and tree search , 2016, Nature.

[24] David R. Karger,et al. Approximation algorithms for orienteering and discounted-reward TSP , 2003, 44th Annual IEEE Symposium on Foundations of Computer Science, 2003. Proceedings..

[25] Wojciech Jaskowski,et al. ViZDoom: A Doom-based AI research platform for visual reinforcement learning , 2016, 2016 IEEE Conference on Computational Intelligence and Games (CIG).

[26] Kevin Waugh,et al. DeepStack: Expert-level artificial intelligence in heads-up no-limit poker , 2017, Science.

[27] David S. Johnson,et al. The Traveling Salesman Problem: A Case Study in Local Optimization , 2008 .

[28] Romain Laroche,et al. Hybrid Reward Architecture for Reinforcement Learning , 2017, NIPS.

[29] Shie Mannor,et al. A Deep Hierarchical Approach to Lifelong Learning in Minecraft , 2016, AAAI.