Policy-gradient for robust planning

Real-world Decision-Theoretic Planning (DTP) is a very challenging research field. A common approach is to model such problems as Markov Decision Problems (MDP) and use dynamic programming techniques. Yet, two major difficulties arise: 1dynamic programming does not scale with the number of tasks, and 2the probabilistic model may be uncertain, leading to the choice of unsafe policies. We build here on Policy Gradient algorithms to address the first difficulty and on robust decision-making to address the second one through algorithms that train competing learning agents. The first agent learns the plan while the second learns the model most likely to upset the plan. It is known from gradient-based game theory that at least one player may not converge, so we focus on convergence of the robust plan only, using non-symmetric algorithms.

[1]  Sylvie Thiébaux,et al.  Prottle: A Probabilistic Temporal Planner , 2005, AAAI.

[2]  A. Cassandra,et al.  Exact and approximate algorithms for partially observable markov decision processes , 1998 .

[3]  Håkan L. S. Younes,et al.  Policy Generation for Continuous-time Stochastic Domains with Concurrency , 2004, ICAPS.

[4]  Douglas Aberdeen,et al.  Policy-Gradient Methods for Planning , 2005, NIPS.

[5]  Peter L. Bartlett,et al.  Experiments with Infinite-Horizon, Policy-Gradient Estimation , 2001, J. Artif. Intell. Res..

[6]  Lin Zhang,et al.  Decision-Theoretic Military Operations Planning , 2004, ICAPS.

[7]  Manuela M. Veloso,et al.  Multiagent learning using a variable learning rate , 2002, Artif. Intell..

[8]  Mausam,et al.  Concurrent Probabilistic Temporal Planning , 2005, ICAPS.

[9]  Olivier Buffet,et al.  Robust Planning with (L)RTDP , 2005, IJCAI.

[10]  Martin Zinkevich,et al.  Online Convex Programming and Generalized Infinitesimal Gradient Ascent , 2003, ICML.

[11]  Yishay Mansour,et al.  Nash Convergence of Gradient Dynamics in General-Sum Games , 2000, UAI.

[12]  Bernhard Nebel,et al.  The FF Planning System: Fast Plan Generation Through Heuristic Search , 2011, J. Artif. Intell. Res..

[13]  Laurent El Ghaoui,et al.  Robustness in Markov Decision Problems with Uncertain Transition Matrices , 2003, NIPS.

[14]  Leslie Pack Kaelbling,et al.  Playing is believing: The role of beliefs in multi-agent learning , 2001, NIPS.

[15]  Peter L. Bartlett,et al.  Infinite-Horizon Policy-Gradient Estimation , 2001, J. Artif. Intell. Res..