Large-Scale Markov Decision Problems with KL Control Cost and its Application to Crowdsourcing

We study average and total cost Markov decision problems with large state spaces. Since the computational and statistical cost of finding the optimal policy scales with the size of the state space, we focus on searching for near-optimality in a low-dimensional family of policies. In particular, we show that for problems with a Kullback-Leibler divergence cost function, we can recast policy optimization as a convex optimization and solve it approximately using a stochastic subgradient algorithm. This method scales in complexity with the family of policies but not the state space. We show that the performance of the resulting policy is close to the best in the low-dimensional family. We demonstrate the efficacy of our approach by optimizing a policy for budget allocation in crowd labeling, an important crowdsourcing application.

[1]  P. Schweitzer,et al.  Generalized polynomial approximations in Markovian decision processes , 1985 .

[2]  Shalabh Bhatnagar,et al.  Toward Off-Policy Learning Control with Function Approximation , 2010, ICML.

[3]  Richard S. Sutton,et al.  Learning to predict by the methods of temporal differences , 1988, Machine Learning.

[4]  Xi Chen,et al.  Optimistic Knowledge Gradient Policy for Optimal Budget Allocation in Crowdsourcing , 2013, ICML.

[5]  John N. Tsitsiklis,et al.  Analysis of Temporal-Diffference Learning with Function Approximation , 1996, NIPS.

[6]  Benjamin Van Roy,et al.  On Constraint Sampling in the Linear Programming Approach to Approximate Dynamic Programming , 2004, Math. Oper. Res..

[7]  Ohad Shamir,et al.  Optimal Distributed Online Prediction Using Mini-Batches , 2010, J. Mach. Learn. Res..

[8]  Milos Hauskrecht,et al.  Linear Program Approximations for Factored Continuous-State Markov Decision Processes , 2003, NIPS.

[9]  R. Sutton,et al.  A convergent O ( n ) algorithm for off-policy temporal-difference learning with linear function approximation , 2008, NIPS 2008.

[10]  Alexander J. Smola,et al.  Parallelized Stochastic Gradient Descent , 2010, NIPS.

[11]  John N. Tsitsiklis,et al.  Neuro-Dynamic Programming , 1996, Encyclopedia of Machine Learning.

[12]  John N. Tsitsiklis,et al.  The Complexity of Markov Decision Processes , 1987, Math. Oper. Res..

[13]  Richard S. Sutton,et al.  Introduction to Reinforcement Learning , 1998 .

[14]  Jean-Yves Audibert Optimization for Machine Learning , 1995 .

[15]  Emanuel Todorov,et al.  Moving least-squares approximations for linearly-solvable MDP , 2011, 2011 IEEE Symposium on Adaptive Dynamic Programming and Reinforcement Learning (ADPRL).

[16]  Emanuel Todorov,et al.  Linearly-solvable Markov decision problems , 2006, NIPS.

[17]  Martin J. Wainwright,et al.  Randomized Smoothing for Stochastic Optimization , 2011, SIAM J. Optim..

[18]  A. S. Manne Linear Programming and Sequential Decisions , 1960 .

[19]  Martin L. Puterman,et al.  Markov Decision Processes: Discrete Stochastic Dynamic Programming , 1994 .

[20]  Shalabh Bhatnagar,et al.  Convergent Temporal-Difference Learning with Arbitrary Smooth Function Approximation , 2009, NIPS.

[21]  Michael H. Veatch,et al.  Approximate Linear Programming for Average Cost MDPs , 2013, Math. Oper. Res..

[22]  H. Kushner,et al.  Stochastic Approximation and Recursive Algorithms and Applications , 2003 .

[23]  Vivek F. Farias,et al.  Approximate Dynamic Programming via a Smoothed Linear Program , 2009, Oper. Res..

[24]  Richard S. Sutton,et al.  Reinforcement Learning: An Introduction , 1998, IEEE Trans. Neural Networks.

[25]  Dimitri P. Bertsekas,et al.  Dynamic Programming and Optimal Control, Vol. II , 1976 .

[26]  Milos Hauskrecht,et al.  Solving Factored MDPs with Continuous and Discrete Variables , 2004, UAI.

[27]  Benjamin Van Roy,et al.  The Linear Programming Approach to Approximate Dynamic Programming , 2003, Oper. Res..

[28]  Benjamin Van Roy,et al.  A Cost-Shaping Linear Program for Average-Cost Approximate Dynamic Programming with Performance Guarantees , 2006, Math. Oper. Res..

[29]  Marek Petrik,et al.  Constraint relaxation in approximate linear programs , 2009, ICML '09.

[30]  Peter L. Bartlett,et al.  Linear Programming for Large-Scale Markov Decision Problems , 2014, ICML.

[31]  Emanuel Todorov,et al.  Eigenfunction approximation methods for linearly-solvable optimal control problems , 2009, 2009 IEEE Symposium on Adaptive Dynamic Programming and Reinforcement Learning.

[32]  Shalabh Bhatnagar,et al.  Fast gradient-descent methods for temporal-difference learning with linear function approximation , 2009, ICML '09.

[33]  Emanuel Todorov,et al.  Efficient computation of optimal actions , 2009, Proceedings of the National Academy of Sciences.

[34]  Dimitri P. Bertsekas,et al.  Dynamic Programming and Optimal Control, Two Volume Set , 1995 .

[35]  Ohad Shamir,et al.  Stochastic Gradient Descent for Non-smooth Optimization: Convergence Results and Optimal Averaging Schemes , 2012, ICML.