Large-Scale Markov Decision Problems via the Linear Programming Dual

We consider the problem of controlling a fully specified Markov decision process (MDP), also known as the planning problem, when the state space is very large and calculating the optimal policy is intractable. Instead, we pursue the more modest goal of optimizing over some small family of policies. Specifically, we show that the family of policies associated with a low-dimensional approximation of occupancy measures yields a tractable optimization. Moreover, we propose an efficient algorithm, scaling with the size of the subspace but not the state space, that is able to find a policy with low excess loss relative to the best policy in this class. To the best of our knowledge, such results did not exist in the literature previously. We bound excess loss in the average cost and discounted cost cases, which are treated separately. Preliminary experiments show the effectiveness of the proposed algorithms in a queueing application.

[1]  Benjamin Van Roy,et al.  Approximate Linear Programming for Average-Cost Dynamic Programming , 2002, NIPS.

[2]  Martin Zinkevich,et al.  Online Convex Programming and Generalized Infinitesimal Gradient Ascent , 2003, ICML.

[3]  Shalabh Bhatnagar,et al.  A Linearly Relaxed Approximate Linear Program for Markov Decision Processes , 2017, IEEE Transactions on Automatic Control.

[4]  R. Sutton,et al.  A convergent O ( n ) algorithm for off-policy temporal-difference learning with linear function approximation , 2008, NIPS 2008.

[5]  Shalabh Bhatnagar,et al.  Toward Off-Policy Learning Control with Function Approximation , 2010, ICML.

[6]  Benjamin Van Roy,et al.  On Constraint Sampling in the Linear Programming Approach to Approximate Dynamic Programming , 2004, Math. Oper. Res..

[7]  Sean R Eddy,et al.  What is dynamic programming? , 2004, Nature Biotechnology.

[8]  Milos Hauskrecht,et al.  Solving Factored MDPs with Continuous and Discrete Variables , 2004, UAI.

[9]  Shalabh Bhatnagar,et al.  Fast gradient-descent methods for temporal-difference learning with linear function approximation , 2009, ICML '09.

[10]  Yasin Abbasi-Yadkori,et al.  Optimizing over a Restricted Policy Class in Markov Decision Processes , 2018, ArXiv.

[11]  T. Lai,et al.  Self-Normalized Processes: Limit Theory and Statistical Applications , 2001 .

[12]  Benjamin Van Roy,et al.  The Linear Programming Approach to Approximate Dynamic Programming , 2003, Oper. Res..

[13]  Benjamin Van Roy,et al.  A Cost-Shaping Linear Program for Average-Cost Approximate Dynamic Programming with Performance Guarantees , 2006, Math. Oper. Res..

[14]  Marek Petrik,et al.  Constraint relaxation in approximate linear programs , 2009, ICML '09.

[15]  Dimitri P. Bertsekas,et al.  Dynamic Programming and Optimal Control, Two Volume Set , 1995 .

[16]  P. Schweitzer,et al.  Generalized polynomial approximations in Markovian decision processes , 1985 .

[17]  Lin Xiao,et al.  A Proximal Stochastic Gradient Method with Progressive Variance Reduction , 2014, SIAM J. Optim..

[18]  Milos Hauskrecht,et al.  Linear Program Approximations for Factored Continuous-State Markov Decision Processes , 2003, NIPS.

[19]  Michael Bowling,et al.  Dual Representations for Dynamic Programming , 2008 .

[20]  John N. Tsitsiklis,et al.  Neuro-Dynamic Programming , 1996, Encyclopedia of Machine Learning.

[21]  Csaba Szepesvari,et al.  Online learning for linearly parametrized control problems , 2012 .

[22]  Lihong Li,et al.  Scalable Bilinear π Learning Using State and Action Features , 2018, ICML 2018.

[23]  A. S. Manne Linear Programming and Sequential Decisions , 1960 .

[24]  Martin L. Puterman,et al.  Markov Decision Processes: Discrete Stochastic Dynamic Programming , 1994 .

[25]  Shalabh Bhatnagar,et al.  Convergent Temporal-Difference Learning with Arbitrary Smooth Function Approximation , 2009, NIPS.

[26]  Michael H. Veatch,et al.  Approximate Linear Programming for Average Cost MDPs , 2013, Math. Oper. Res..

[27]  Adam Tauman Kalai,et al.  Online convex optimization in the bandit setting: gradient descent without a gradient , 2004, SODA '05.

[28]  Vivek F. Farias,et al.  Approximate Dynamic Programming via a Smoothed Linear Program , 2009, Oper. Res..