On Instrumental Variable Regression for Deep Offline Policy Evaluation

We show that the popular reinforcement learning (RL) strategy of estimating the state-action value (Q-function) by minimizing the mean squared Bellman error leads to a regression problem with confounding, the inputs and output noise being correlated. Hence, direct minimization of the Bellman error can result in significantly biased Qfunction estimates. We explain why fixing the target Q-network in Deep Q-Networks and Fitted Q Evaluation provides a way of overcoming this confounding, thus shedding new light on this popular but not well understood trick in the deep RL literature. An alternative approach to address confounding is to leverage techniques developed in the causality literature, notably instrumental variables (IV). We bring together here the literature on IV and RL by investigating whether IV approaches can lead to improved Q-function estimates. This paper analyzes and compares a wide range of recent IV methods in the context of offline policy evaluation (OPE), where the goal is to estimate the value of a policy using logged data only. By applying different IV techniques to OPE, we are not only able to recover previously proposed OPE methods such as modelbased techniques but also to obtain competitive new techniques. We find empirically that state-of-the-art OPE methods are closely matched in performance by some IV methods such as AGMM, which were not developed for OPE.

[1]  Luofeng Liao,et al.  Provably Efficient Neural Estimation of Structural Equation Model: An Adversarial Approach , 2020, NeurIPS.

[2]  John Langford,et al.  Doubly Robust Policy Evaluation and Learning , 2011, ICML.

[3]  Luofeng Liao,et al.  Instrumental Variable Value Iteration for Causal Offline Reinforcement Learning , 2021, ArXiv.

[4]  Vasilis Syrgkanis,et al.  Adversarial Generalized Method of Moments , 2018, ArXiv.

[5]  Csaba Szepesvári,et al.  Finite-Time Bounds for Fitted Value Iteration , 2008, J. Mach. Learn. Res..

[6]  Arthur Gretton,et al.  Kernel Instrumental Variable Regression , 2019, NeurIPS.

[7]  Ehsan Saleh Deterministic Bellman Residual Minimization , 2019 .

[8]  Ye Luo,et al.  Causal Reinforcement Learning: An Instrumental Variable Approach , 2021, SSRN Electronic Journal.

[9]  Richard S. Sutton,et al.  Reinforcement Learning: An Introduction , 1998, IEEE Trans. Neural Networks.

[10]  Shie Mannor,et al.  Reinforcement learning with Gaussian processes , 2005, ICML.

[11]  Nishanth Dikkala,et al.  Minimax Estimation of Conditional Moment Models , 2020, NeurIPS.

[12]  Andrew Bennett,et al.  Deep Generalized Method of Moments for Instrumental Variable Analysis , 2019, NeurIPS.

[13]  Sergey Levine,et al.  Benchmarks for Deep Off-Policy Evaluation , 2021, ICLR.

[14]  Steven J. Bradtke,et al.  Linear Least-Squares algorithms for temporal difference learning , 2004, Machine Learning.

[15]  Mohammad Norouzi,et al.  Autoregressive Dynamics Models for Offline Policy Evaluation and Optimization , 2021, ICLR.

[16]  Richard S. Sutton,et al.  Integrated Architectures for Learning, Planning, and Reacting Based on Approximating Dynamic Programming , 1990, ML.

[17]  J. Stock,et al.  Retrospectives Who Invented Instrumental Variable Regression , 2003 .

[18]  Bo Dai,et al.  DualDICE: Behavior-Agnostic Estimation of Discounted Stationary Distribution Corrections , 2019, NeurIPS.

[19]  Bo Dai,et al.  Off-Policy Evaluation via the Regularized Lagrangian , 2020, NeurIPS.

[20]  Masatoshi Uehara,et al.  Minimax Weight and Q-Function Learning for Off-Policy Evaluation , 2019, ICML.

[21]  Nando de Freitas,et al.  Hyperparameter Selection for Offline Reinforcement Learning , 2020, ArXiv.

[22]  Sameera S. Ponda,et al.  Autonomous navigation of stratospheric balloons using reinforcement learning , 2020, Nature.

[23]  Nando de Freitas,et al.  Learning Deep Features in Instrumental Variable Regression , 2020, ICLR.

[24]  J. Florens,et al.  Linear Inverse Problems in Structural Econometrics Estimation Based on Spectral Decomposition and Regularization , 2003 .

[25]  Yisong Yue,et al.  Batch Policy Learning under Constraints , 2019, ICML.

[26]  Michail G. Lagoudakis,et al.  Least-Squares Policy Iteration , 2003, J. Mach. Learn. Res..

[27]  Pawel Wawrzynski,et al.  Real-time reinforcement learning by sequential Actor-Critics and experience replay , 2009, Neural Networks.

[28]  Nan Jiang,et al.  Doubly Robust Off-policy Value Evaluation for Reinforcement Learning , 2015, ICML.

[29]  Joshua D. Angrist,et al.  Lifetime Earnings and the Vietnam Era Draft Lottery: Evidence from Social Security Administrative Records , 1990 .

[30]  Kevin Leyton-Brown,et al.  Deep IV: A Flexible Approach for Counterfactual Prediction , 2017, ICML.

[31]  Matthew W. Hoffman,et al.  Distributed Distributional Deterministic Policy Gradients , 2018, ICLR.

[32]  Andrew W. Moore,et al.  Efficient memory-based learning for robot control , 1990 .

[33]  Demis Hassabis,et al.  Mastering the game of Go with deep neural networks and tree search , 2016, Nature.

[34]  Richard S. Sutton,et al.  Neuronlike adaptive elements that can solve difficult learning control problems , 1983, IEEE Transactions on Systems, Man, and Cybernetics.

[35]  Philip S. Thomas,et al.  Data-Efficient Off-Policy Policy Evaluation for Reinforcement Learning , 2016, ICML.

[36]  Sanjoy Dasgupta,et al.  Off-Policy Temporal Difference Learning with Function Approximation , 2001, ICML.

[37]  Alex Graves,et al.  Playing Atari with Deep Reinforcement Learning , 2013, ArXiv.

[38]  Pierre Geurts,et al.  Tree-Based Batch Mode Reinforcement Learning , 2005, J. Mach. Learn. Res..

[39]  Shane Legg,et al.  Human-level control through deep reinforcement learning , 2015, Nature.

[40]  L. Hansen Large Sample Properties of Generalized Method of Moments Estimators , 1982 .

[41]  Leemon C. Baird,et al.  Residual Algorithms: Reinforcement Learning with Function Approximation , 1995, ICML.

[42]  Timothy M. Christensen,et al.  Optimal sup-norm rates and uniform inference on nonlinear functionals of nonparametric IV regression: Nonlinear functionals of nonparametric IV , 2018 .

[43]  Yuval Tassa,et al.  Continuous control with deep reinforcement learning , 2015, ICLR.

[44]  Qiang Liu,et al.  Breaking the Curse of Horizon: Infinite-Horizon Off-Policy Estimation , 2018, NeurIPS.

[45]  Doina Precup,et al.  Eligibility Traces for Off-Policy Policy Evaluation , 2000, ICML.

[46]  J. Horowitz,et al.  Measuring the price responsiveness of gasoline demand: Economic shape restrictions and nonparametric demand estimation , 2011 .

[47]  Philip G. Wright,et al.  The tariff on animal and vegetable oils , 1928 .

[48]  Sergio Gomez Colmenarejo,et al.  RL Unplugged: A Suite of Benchmarks for Offline Reinforcement Learning , 2020 .

[49]  Qiang Liu,et al.  Black-box Off-policy Estimation for Infinite-Horizon Reinforcement Learning , 2020, ICLR.

[50]  Tor Lattimore,et al.  Behaviour Suite for Reinforcement Learning , 2019, ICLR.

[51]  S. Levine,et al.  Offline Reinforcement Learning: Tutorial, Review, and Perspectives on Open Problems , 2020, ArXiv.

[52]  Joshua D. Angrist,et al.  Identification of Causal Effects Using Instrumental Variables , 1993 .

[53]  W. Newey,et al.  Instrumental variable estimation of nonparametric models , 2003 .

[54]  Mehrdad Farajtabar,et al.  More Robust Doubly Robust Off-policy Evaluation , 2018, ICML.

[55]  Yuval Tassa,et al.  Synthesis and stabilization of complex behaviors through online trajectory optimization , 2012, 2012 IEEE/RSJ International Conference on Intelligent Robots and Systems.

[56]  Ali Mousavi,et al.  Off-policy Evaluation in Infinite-Horizon Reinforcement Learning with Latent Confounders , 2020, AISTATS.

[57]  Csaba Szepesvári,et al.  Learning near-optimal policies with Bellman-residual minimization based fitted policy iteration and a single sample path , 2006, Machine Learning.

[58]  Hoang Minh Le,et al.  Empirical Study of Off-Policy Policy Evaluation for Reinforcement Learning , 2019, NeurIPS Datasets and Benchmarks.

[59]  Krikamol Muandet,et al.  Dual IV: A Single Stage Instrumental Variable Regression , 2019, ArXiv.