Inverse Policy Evaluation for Value-based Sequential Decision-making

Value-based methods for reinforcement learning lack generally applicable ways to derive behavior from a value function. Many approaches involve approximate value iteration (e.g., $Q$-learning), and acting greedily with respect to the estimates with an arbitrary degree of entropy to ensure that the state-space is sufficiently explored. Behavior based on explicit greedification assumes that the values reflect those of \textit{some} policy, over which the greedy policy will be an improvement. However, value-iteration can produce value functions that do not correspond to \textit{any} policy. This is especially relevant in the function-approximation regime, when the true value function can't be perfectly represented. In this work, we explore the use of \textit{inverse policy evaluation}, the process of solving for a likely policy given a value function, for deriving behavior from a value function. We provide theoretical and empirical results to show that inverse policy evaluation, combined with an approximate value iteration algorithm, is a feasible method for value-based control.

[1]  Tian Tian,et al.  MinAtar: An Atari-Inspired Testbed for Thorough and Reproducible Reinforcement Learning Experiments , 2019 .

[2]  Paul Wagner,et al.  Optimistic policy iteration and natural actor-critic: A unifying view and a non-optimality result , 2013, NIPS.

[3]  Doina Precup,et al.  A Convergent Form of Approximate Policy Iteration , 2002, NIPS.

[4]  Craig Boutilier,et al.  Non-delusional Q-learning and value-iteration , 2018, NeurIPS.

[5]  Martin L. Puterman,et al.  Markov Decision Processes: Discrete Stochastic Dynamic Programming , 1994 .

[6]  Richard S. Sutton,et al.  Reinforcement Learning: An Introduction , 1998, IEEE Trans. Neural Networks.

[7]  James Bergstra,et al.  Autoregressive Policies for Continuous Control Deep Reinforcement Learning , 2019, IJCAI.

[8]  Theodore J. Perkins,et al.  On the Existence of Fixed Points for Q-Learning and Sarsa in Partially Observable Domains , 2002, ICML.

[9]  John N. Tsitsiklis,et al.  Analysis of Temporal-Diffference Learning with Function Approximation , 1996, NIPS.

[10]  Kavosh Asadi,et al.  An Alternative Softmax Operator for Reinforcement Learning , 2016, ICML.

[11]  Jurgen Schmidhuber,et al.  Training Agents using Upside-Down Reinforcement Learning , 2019, ArXiv.

[12]  John Langford,et al.  Approximately Optimal Approximate Reinforcement Learning , 2002, ICML.

[13]  Sergey Levine,et al.  Trust Region Policy Optimization , 2015, ICML.

[14]  Nicolas Le Roux,et al.  The Value Function Polytope in Reinforcement Learning , 2019, ICML.

[15]  Hado van Hasselt,et al.  Double Q-learning , 2010, NIPS.

[16]  Alex Graves,et al.  Playing Atari with Deep Reinforcement Learning , 2013, ArXiv.

[17]  Leemon C. Baird,et al.  Residual Algorithms: Reinforcement Learning with Function Approximation , 1995, ICML.

[18]  Matthieu Geist,et al.  Deep Conservative Policy Iteration , 2019, AAAI.

[19]  Peter Dayan,et al.  Q-learning , 1992, Machine Learning.

[20]  Pieter Abbeel,et al.  Towards Characterizing Divergence in Deep Q-Learning , 2019, ArXiv.

[21]  R. Sutton,et al.  Gradient temporal-difference learning algorithms , 2011 .

[22]  D. Bertsekas Approximate policy iteration: a survey and some new methods , 2011 .

[23]  Yishay Mansour,et al.  Policy Gradient Methods for Reinforcement Learning with Function Approximation , 1999, NIPS.

[24]  Richard S. Sutton,et al.  Learning to predict by the methods of temporal differences , 1988, Machine Learning.

[25]  Wojciech Zaremba,et al.  OpenAI Gym , 2016, ArXiv.

[26]  John N. Tsitsiklis,et al.  Neuro-Dynamic Programming , 1996, Encyclopedia of Machine Learning.