Infinite-Horizon Offline Reinforcement Learning with Linear Function Approximation: Curse of Dimensionality and Algorithm

In this paper, we investigate the sample complexity of policy evaluation in infinitehorizon offline reinforcement learning (also known as the off-policy evaluation problem) with linear function approximation. We identify a hard regime dγ2 > 1, where d is the dimension of the feature vector and γ is the discount rate. In this regime, for any q ∈ [γ2, 1], we can construct a hard instance such that the smallest eigenvalue of its feature covariance matrix is q/d and it requires Ω ( d γ2(q−γ2)ε2 exp ( Θ ( dγ2 ))) samples to approximate the value function up to an additive error ε. Note that the lower bound of the sample complexity is exponential in d. If q = γ2, even infinite data cannot suffice. Under the low distribution shift assumption, we show that there is an algorithm that needs at most O ( max { ‖θ‖42 ε4 log dδ , 1 ε2 ( d+ log 1δ )}) samples (θπ is the parameter of the policy in linear function approximation) and guarantees approximation to the value function up to an additive error of ε with probability at least 1− δ.

[1]  Sean P. Meyn,et al.  An analysis of reinforcement learning with function approximation , 2008, ICML '08.

[2]  Philip Amortila,et al.  A Variant of the Wang-Foster-Kakade Lower Bound for the Discounted Setting , 2020, ArXiv.

[3]  Peter L. Bartlett,et al.  Neural Network Learning - Theoretical Foundations , 1999 .

[4]  Sham M. Kakade,et al.  A tail inequality for quadratic forms of subgaussian random vectors , 2011, ArXiv.

[5]  Doina Precup,et al.  Off-Policy Deep Reinforcement Learning without Exploration , 2018, ICML.

[6]  Yu Bai,et al.  Near Optimal Provable Uniform Convergence in Off-Policy Evaluation for Reinforcement Learning , 2021, AISTATS.

[7]  Nan Jiang,et al.  Information-Theoretic Considerations in Batch Reinforcement Learning , 2019, ICML.

[8]  Mengdi Wang,et al.  Minimax-Optimal Off-Policy Evaluation with Linear Function Approximation , 2020, ICML.

[9]  Joel A. Tropp,et al.  User-Friendly Tail Bounds for Sums of Random Matrices , 2010, Found. Comput. Math..

[10]  Michael I. Jordan,et al.  Provably Efficient Reinforcement Learning with Kernel and Neural Function Approximations , 2020, NeurIPS.

[11]  Geoffrey J. Gordon Reinforcement Learning with Function Approximation Converges to a Region , 2000, NIPS.

[12]  Ruosong Wang,et al.  What are the Statistical Limits of Offline RL with Linear Function Approximation? , 2020, ICLR.

[13]  Mohammad Norouzi,et al.  An Optimistic Perspective on Offline Reinforcement Learning , 2020, ICML.

[14]  Quanquan Gu,et al.  Logarithmic Regret for Reinforcement Learning with Linear Function Approximation , 2020, ICML.

[15]  Masatoshi Uehara,et al.  Finite Sample Analysis of Minimax Offline Reinforcement Learning: Completeness, Fast Rates and First-Order Efficiency , 2021, ArXiv.

[16]  Andrea Zanette,et al.  Exponential Lower Bounds for Batch Reinforcement Learning: Batch RL can be Exponentially Harder than Online RL , 2020, ICML.

[17]  Ruosong Wang,et al.  Provably Efficient Reinforcement Learning with General Value Function Approximation , 2020, ArXiv.

[18]  Francisco S. Melo,et al.  Q -Learning with Linear Function Approximation , 2007, COLT.

[19]  Michael I. Jordan,et al.  Provably Efficient Reinforcement Learning with Linear Function Approximation , 2019, COLT.

[20]  Steven J. Bradtke,et al.  Linear Least-Squares algorithms for temporal difference learning , 2004, Machine Learning.

[21]  Yishay Mansour,et al.  Policy Gradient Methods for Reinforcement Learning with Function Approximation , 1999, NIPS.

[22]  Richard S. Sutton,et al.  Reinforcement Learning: An Introduction , 1998, IEEE Trans. Neural Networks.

[23]  Bo Dai,et al.  Off-Policy Evaluation via the Regularized Lagrangian , 2020, NeurIPS.

[24]  Sham M. Kakade,et al.  On the Theory of Policy Gradient Methods: Optimality, Approximation, and Distribution Shift , 2019, J. Mach. Learn. Res..

[25]  S. Levine,et al.  Offline Reinforcement Learning: Tutorial, Review, and Perspectives on Open Problems , 2020, ArXiv.

[26]  Qiang Liu,et al.  Accountable Off-Policy Evaluation With Kernel Bellman Statistics , 2020, ICML.