A two-teams approach for robust probabilistic temporal planning

Large real-world Probabilistic Temporal Planning (PTP) is a very challenging research field. A common approach is to model such problems as Markov Decision Problems (MDP) and use dynamic programming techniques. Yet, two major difficulties arise: 1dynamic programming does not scale with the number of tasks, and 2the probabilistic model may be uncertain, leading to the choice of unsafe policies. We build here on the Factored Policy Gradient (FPG) algorithm and on robust decision-making to address both difficulties through an algorithm that trains two competing teams of learning agents. As the learning is simultaneous, each agent is facing a non-stationary environment. The goal is for them to find a common Nash equilibrium.

[1]  Lin Zhang,et al.  Decision-Theoretic Military Operations Planning , 2004, ICAPS.

[2]  Sylvie Thiébaux,et al.  Prottle: A Probabilistic Temporal Planner , 2005, AAAI.

[3]  Peter L. Bartlett,et al.  Experiments with Infinite-Horizon, Policy-Gradient Estimation , 2001, J. Artif. Intell. Res..

[4]  Paolo Traverso,et al.  Automated planning - theory and practice , 2004 .

[5]  Peter L. Bartlett,et al.  Infinite-Horizon Policy-Gradient Estimation , 2001, J. Artif. Intell. Res..

[6]  Douglas Aberdeen Policy-Gradient Methods for Planning , 2005, NIPS.

[7]  Andrew G. Barto,et al.  Learning to Act Using Real-Time Dynamic Programming , 1995, Artif. Intell..

[8]  John N. Tsitsiklis,et al.  Neuro-Dynamic Programming , 1996, Encyclopedia of Machine Learning.

[9]  Bernhard Nebel,et al.  The FF Planning System: Fast Plan Generation Through Heuristic Search , 2011, J. Artif. Intell. Res..

[10]  Michael H. Bowling,et al.  Convergence and No-Regret in Multiagent Learning , 2004, NIPS.

[11]  Olivier Buffet,et al.  Simulation methods for uncertain decision-theoretic planning , 2005, IJCAI 2005.

[12]  Masanori Hosaka,et al.  Controlled Markov set-chains under average criteria , 2001, Appl. Math. Comput..

[13]  Leslie Pack Kaelbling,et al.  Playing is believing: The role of beliefs in multi-agent learning , 2001, NIPS.

[14]  Laurent El Ghaoui,et al.  Robustness in Markov Decision Problems with Uncertain Transition Matrices , 2003, NIPS.

[15]  Rémi Munos Efficient Resources Allocation for Markov Decision Processes , 2001, NIPS.

[16]  Michael L. Littman,et al.  An empirical evaluation of interval estimation for Markov decision processes , 2004, 16th IEEE International Conference on Tools with Artificial Intelligence.

[17]  Olivier Buffet Fast reachability analysis for uncertain SSPs , 2005, IJCAI 2005.

[18]  Mausam,et al.  Concurrent Probabilistic Temporal Planning , 2005, ICAPS.

[19]  Lex Weaver,et al.  A Multi-Agent Policy-Gradient Approach to Network Routing , 2001, ICML.

[20]  Olivier Buffet,et al.  Robust Planning with (L)RTDP , 2005, IJCAI.

[21]  Martin Zinkevich,et al.  Online Convex Programming and Generalized Infinitesimal Gradient Ascent , 2003, ICML.

[22]  Robert Givan,et al.  Bounded Parameter Markov Decision Processes , 1997, ECP.

[23]  R. J. Williams Simple Statistical Gradient-Following Algorithms for Connectionist Reinforcement Learning , 1992, Machine Learning.

[24]  Kee-Eung Kim,et al.  Learning to Cooperate via Policy Search , 2000, UAI.

[25]  Manuela M. Veloso,et al.  Multiagent learning using a variable learning rate , 2002, Artif. Intell..

[26]  Håkan L. S. Younes,et al.  Policy Generation for Continuous-time Stochastic Domains with Concurrency , 2004, ICAPS.

[27]  Olivier Buffet,et al.  Multi-Agent Systems by Incremental Gradient Reinforcement Learning , 2001, IJCAI.

[28]  Sean R Eddy,et al.  What is dynamic programming? , 2004, Nature Biotechnology.