论文信息 - Inverse Policy Evaluation for Value-based Sequential Decision-making - 字舞流文

Inverse Policy Evaluation for Value-based Sequential Decision-making

Value-based methods for reinforcement learning lack generally applicable ways to derive behavior from a value function. Many approaches involve approximate value iteration (e.g., $Q$-learning), and acting greedily with respect to the estimates with an arbitrary degree of entropy to ensure that the state-space is sufficiently explored. Behavior based on explicit greedification assumes that the values reflect those of \textit{some} policy, over which the greedy policy will be an improvement. However, value-iteration can produce value functions that do not correspond to \textit{any} policy. This is especially relevant in the function-approximation regime, when the true value function can't be perfectly represented. In this work, we explore the use of \textit{inverse policy evaluation}, the process of solving for a likely policy given a value function, for deriving behavior from a value function. We provide theoretical and empirical results to show that inverse policy evaluation, combined with an approximate value iteration algorithm, is a feasible method for value-based control.

Richard S. Sutton | Alan Chan | Kristopher De Asis | Kris de Asis | R. Sutton | Alan Chan

[1] Tian Tian,et al. MinAtar: An Atari-Inspired Testbed for Thorough and Reproducible Reinforcement Learning Experiments , 2019 .

[2] Paul Wagner,et al. Optimistic policy iteration and natural actor-critic: A unifying view and a non-optimality result , 2013, NIPS.

[3] Doina Precup,et al. A Convergent Form of Approximate Policy Iteration , 2002, NIPS.

[4] Craig Boutilier,et al. Non-delusional Q-learning and value-iteration , 2018, NeurIPS.

[5] Martin L. Puterman,et al. Markov Decision Processes: Discrete Stochastic Dynamic Programming , 1994 .

[6] Richard S. Sutton,et al. Reinforcement Learning: An Introduction , 1998, IEEE Trans. Neural Networks.

[7] James Bergstra,et al. Autoregressive Policies for Continuous Control Deep Reinforcement Learning , 2019, IJCAI.

[8] Theodore J. Perkins,et al. On the Existence of Fixed Points for Q-Learning and Sarsa in Partially Observable Domains , 2002, ICML.

[9] John N. Tsitsiklis,et al. Analysis of Temporal-Diffference Learning with Function Approximation , 1996, NIPS.

[10] Kavosh Asadi,et al. An Alternative Softmax Operator for Reinforcement Learning , 2016, ICML.

[11] Jurgen Schmidhuber,et al. Training Agents using Upside-Down Reinforcement Learning , 2019, ArXiv.

[12] John Langford,et al. Approximately Optimal Approximate Reinforcement Learning , 2002, ICML.

[13] Sergey Levine,et al. Trust Region Policy Optimization , 2015, ICML.

[14] Nicolas Le Roux,et al. The Value Function Polytope in Reinforcement Learning , 2019, ICML.

[15] Hado van Hasselt,et al. Double Q-learning , 2010, NIPS.

[16] Alex Graves,et al. Playing Atari with Deep Reinforcement Learning , 2013, ArXiv.

[17] Leemon C. Baird,et al. Residual Algorithms: Reinforcement Learning with Function Approximation , 1995, ICML.

[18] Matthieu Geist,et al. Deep Conservative Policy Iteration , 2019, AAAI.

[19] Peter Dayan,et al. Q-learning , 1992, Machine Learning.

[20] Pieter Abbeel,et al. Towards Characterizing Divergence in Deep Q-Learning , 2019, ArXiv.

[21] R. Sutton,et al. Gradient temporal-difference learning algorithms , 2011 .

[22] D. Bertsekas. Approximate policy iteration: a survey and some new methods , 2011 .

[23] Yishay Mansour,et al. Policy Gradient Methods for Reinforcement Learning with Function Approximation , 1999, NIPS.

[24] Richard S. Sutton,et al. Learning to predict by the methods of temporal differences , 1988, Machine Learning.

[25] Wojciech Zaremba,et al. OpenAI Gym , 2016, ArXiv.

[26] John N. Tsitsiklis,et al. Neuro-Dynamic Programming , 1996, Encyclopedia of Machine Learning.