POLITEX: Regret Bounds for Policy Iteration using Expert Prediction

We present POLITEX (POLicy ITeration with EXpert advice), a variant of policy iteration where each policy is a Boltzmann distribution over the sum of action-value function estimates of the previous policies, and analyze its regret in continuing RL problems. We assume that the value function error after running a policy for τ time steps scales as ε(τ) = ε0 + Õ( √ d/τ), where ε0 is the worst-case approximation error and d is the number of features in a compressed representation of the state-action space. We establish that this condition is satisfied by the LSPE algorithm under certain assumptions on the MDP and policies. Under the error assumption, we show that the regret of POLITEX in uniformly mixing MDPs scales as Õ(dT 3/4 + ε0T ), where Õ(·) hides logarithmic terms and problem-dependent constants. Thus, we provide the first regret bound for a fully practical model-free method which only scales in the number of features, and not in the size of the underlying MDP. Experiments on a queuing problem confirm that POLITEX is competitive with some of its alternatives, while preliminary results on Ms Pacman (one of the standard Atari benchmark problems) confirm the viability of POLITEX beyond linear function approximation.

[1]  Michael I. Jordan,et al.  Is Q-learning Provably Efficient? , 2018, NeurIPS.

[2]  Shie Mannor,et al.  Regularized Policy Iteration with Nonparametric Function Spaces , 2016, J. Mach. Learn. Res..

[3]  Gábor Lugosi,et al.  Prediction, learning, and games , 2006 .

[4]  David Budden,et al.  Distributed Prioritized Experience Replay , 2018, ICLR.

[5]  Benjamin Van Roy,et al.  Generalization and Exploration via Randomized Value Functions , 2014, ICML.

[6]  Dimitri P. Bertsekas,et al.  Convergence Results for Some Temporal Difference Methods Based on Least Squares , 2009, IEEE Transactions on Automatic Control.

[7]  Richard S. Sutton,et al.  Multi-step Reinforcement Learning: A Unifying Algorithm , 2017, AAAI.

[8]  John N. Tsitsiklis,et al.  Analysis of Temporal-Diffference Learning with Function Approximation , 1996, NIPS.

[9]  Varun Kanade,et al.  Tracking Adversarial Targets , 2014, ICML.

[10]  Marek Petrik,et al.  Finite-Sample Analysis of Proximal Gradient TD Algorithms , 2015, UAI.

[11]  Dimitri P. Bertsekas,et al.  Error Bounds for Approximations from Projected Linear Equations , 2010, Math. Oper. Res..

[12]  Shie Mannor,et al.  Shallow Updates for Deep Reinforcement Learning , 2017, NIPS.

[13]  Michael L. Littman,et al.  An analysis of model-based Interval Estimation for Markov Decision Processes , 2008, J. Comput. Syst. Sci..

[14]  Martin A. Riedmiller,et al.  Quinoa: a Q-function You Infer Normalized Over Actions , 2019, ArXiv.

[15]  Francesco Orabona,et al.  Scale-Free Algorithms for Online Linear Optimization , 2015, ALT.

[16]  Zheng Wen,et al.  Deep Exploration via Randomized Value Functions , 2017, J. Mach. Learn. Res..

[17]  Csaba Szepesvári,et al.  Online Markov Decision Processes Under Bandit Feedback , 2010, IEEE Transactions on Automatic Control.

[18]  Csaba Szepesvári,et al.  Learning near-optimal policies with Bellman-residual minimization based fitted policy iteration and a single sample path , 2006, Machine Learning.

[19]  David Silver,et al.  Deep Reinforcement Learning with Double Q-Learning , 2015, AAAI.

[20]  Benjamin Van Roy,et al.  A Tutorial on Thompson Sampling , 2017, Found. Trends Mach. Learn..

[21]  D. Bertsekas Approximate policy iteration: a survey and some new methods , 2011 .

[22]  Marlos C. Machado,et al.  Count-Based Exploration with the Successor Representation , 2018, AAAI.

[23]  Benjamin Recht,et al.  Least-Squares Temporal Difference Learning for the Linear Quadratic Regulator , 2017, ICML.

[24]  Alessandro Lazaric,et al.  Finite-sample analysis of least-squares policy iteration , 2012, J. Mach. Learn. Res..

[25]  Matthieu Geist,et al.  Off-policy learning with eligibility traces: a survey , 2013, J. Mach. Learn. Res..

[26]  Vicenç Gómez,et al.  Fast rates for online learning in Linearly Solvable Markov Decision Processes , 2017, COLT.

[27]  Csaba Szepesvari,et al.  Online learning for linearly parametrized control problems , 2012 .

[28]  John Langford,et al.  Search-based structured prediction , 2009, Machine Learning.

[29]  S. Ioffe,et al.  Temporal Differences-Based Policy Iteration and Applications in Neuro-Dynamic Programming , 1996 .

[30]  R. Sutton,et al.  A convergent O ( n ) algorithm for off-policy temporal-difference learning with linear function approximation , 2008, NIPS 2008.

[31]  Ambuj Tewari,et al.  REGAL: A Regularization based Algorithm for Reinforcement Learning in Weakly Communicating MDPs , 2009, UAI.

[32]  Nevena Lazic,et al.  Model-Free Linear Quadratic Control via Reduction to Expert Prediction , 2018, AISTATS.

[33]  Alex Graves,et al.  Playing Atari with Deep Reinforcement Learning , 2013, ArXiv.

[34]  Yuval Tassa,et al.  Maximum a Posteriori Policy Optimisation , 2018, ICLR.

[35]  Nan Jiang,et al.  Contextual Decision Processes with low Bellman rank are PAC-Learnable , 2016, ICML.

[36]  Shie Mannor,et al.  Markov Decision Processes with Arbitrary Reward Processes , 2009, Math. Oper. Res..

[37]  Tom Schaul,et al.  Unifying Count-Based Exploration and Intrinsic Motivation , 2016, NIPS.

[38]  Marc G. Bellemare,et al.  Approximate Exploration through State Abstraction , 2018, ArXiv.

[39]  Peter Auer,et al.  Near-optimal Regret Bounds for Reinforcement Learning , 2008, J. Mach. Learn. Res..

[40]  Tom Schaul,et al.  Dueling Network Architectures for Deep Reinforcement Learning , 2015, ICML.

[41]  C. D. Meyer,et al.  Comparison of perturbation bounds for the stationary distribution of a Markov chain , 2001 .

[42]  X. Cao,et al.  Single Sample Path-Based Optimization of Markov Chains , 1999 .

[43]  Benjamin Van Roy,et al.  The Linear Programming Approach to Approximate Dynamic Programming , 2003, Oper. Res..

[44]  Rémi Munos,et al.  Minimax Regret Bounds for Reinforcement Learning , 2017, ICML.

[45]  Martin L. Puterman,et al.  Markov Decision Processes: Discrete Stochastic Dynamic Programming , 1994 .

[46]  Benjamin Van Roy,et al.  Model-based Reinforcement Learning and the Eluder Dimension , 2014, NIPS.

[47]  Lihong Li,et al.  PAC model-free reinforcement learning , 2006, ICML.

[48]  A. P. Hyper-parameters Count-Based Exploration with Neural Density Models , 2017 .

[49]  Ian Osband,et al.  The Uncertainty Bellman Equation and Exploration , 2017, ICML.

[50]  Marlos C. Machado,et al.  A Laplacian Framework for Option Discovery in Reinforcement Learning , 2017, ICML.

[51]  Csaba Szepesvári,et al.  Regret Bounds for the Adaptive Control of Linear Quadratic Systems , 2011, COLT.

[52]  Bo Liu,et al.  Regularized Off-Policy TD-Learning , 2012, NIPS.

[53]  Yishay Mansour,et al.  Online Markov Decision Processes , 2009, Math. Oper. Res..

[54]  Shalabh Bhatnagar,et al.  Toward Off-Policy Learning Control with Function Approximation , 2010, ICML.

[55]  Geoffrey J. Gordon,et al.  A Reduction of Imitation Learning and Structured Prediction to No-Regret Online Learning , 2010, AISTATS.

[56]  Richard S. Sutton,et al.  Learning to predict by the methods of temporal differences , 1988, Machine Learning.

[57]  Zheng Wen,et al.  Efficient Reinforcement Learning in Deterministic Systems with Value Function Generalization , 2013, Math. Oper. Res..

[58]  Shane Legg,et al.  Human-level control through deep reinforcement learning , 2015, Nature.

[59]  K. Narendra,et al.  Persistent excitation in adaptive systems , 1987 .

[60]  Joel A. Tropp,et al.  User-Friendly Tail Bounds for Sums of Random Matrices , 2010, Found. Comput. Math..

[61]  Csaba Szepesvári,et al.  Algorithms for Reinforcement Learning , 2010, Synthesis Lectures on Artificial Intelligence and Machine Learning.