The factored policy-gradient planner

We present an any-time concurrent probabilistic temporal planner (CPTP) that includes continuous and discrete uncertainties and metric functions. Rather than relying on dynamic programming our approach builds on methods from stochastic local policy search. That is, we optimise a parameterised policy using gradient ascent. The flexibility of this policy-gradient approach, combined with its low memory use, the use of function approximation methods and factorisation of the policy, allow us to tackle complex domains. This factored policy gradient (FPG) planner can optimise steps to goal, the probability of success, or attempt a combination of both. We compare the FPG planner to other planners on CPTP domains, and on simpler but better studied non-concurrent non-temporal probabilistic planning (PP) domains. We present FPG-ipc, the PP version of the planner which has been successful in the probabilistic track of the fifth international planning competition.

[1]  Bernhard Nebel,et al.  The FF Planning System: Fast Plan Generation Through Heuristic Search , 2011, J. Artif. Intell. Res..

[2]  Sham M. Kakade,et al.  A Natural Policy Gradient , 2001, NIPS.

[3]  Lin Zhang,et al.  Decision-Theoretic Military Operations Planning , 2004, ICAPS.

[4]  Robert Givan,et al.  Approximate Policy Iteration with a Policy Language Bias , 2003, NIPS.

[5]  Nicol N. Schraudolph,et al.  Conjugate Directions for Stochastic Gradient Descent , 2002, ICANN.

[6]  Neil Immerman,et al.  The Complexity of Decentralized Control of Markov Decision Processes , 2000, UAI.

[7]  Andrew Y. Ng,et al.  Policy Invariance Under Reward Transformations: Theory and Application to Reward Shaping , 1999, ICML.

[8]  Kee-Eung Kim,et al.  Learning to Cooperate via Policy Search , 2000, UAI.

[9]  Peter L. Bartlett,et al.  Experiments with Infinite-Horizon, Policy-Gradient Estimation , 2001, J. Artif. Intell. Res..

[10]  Mausam,et al.  Challenges for Temporal Planning with Uncertain Durations , 2006, ICAPS.

[11]  Stefan Schaal,et al.  Natural Actor-Critic , 2003, Neurocomputing.

[12]  Olivier Buffet,et al.  FF + FPG: Guiding a Policy-Gradient Planner , 2007, ICAPS.

[13]  John N. Tsitsiklis,et al.  Neuro-Dynamic Programming , 1996, Encyclopedia of Machine Learning.

[14]  Shlomo Zilberstein,et al.  LAO*: A heuristic search algorithm that finds solutions with loops , 2001, Artif. Intell..

[15]  Kurt Driessens,et al.  Relational Reinforcement Learning , 1998, Machine-mediated learning.

[16]  Douglas Aberdeen,et al.  Policy-Gradient Methods for Planning , 2005, NIPS.

[17]  Andrew G. Barto,et al.  Learning to Act Using Real-Time Dynamic Programming , 1995, Artif. Intell..

[18]  Douglas Aberdeen,et al.  Scalable Internal-State Policy-Gradient Methods for POMDPs , 2002, ICML.

[19]  Robert Givan,et al.  FF-Replan: A Baseline for Probabilistic Planning , 2007, ICAPS.

[20]  Mausam,et al.  Planning with Durative Actions in Stochastic Domains , 2008, J. Artif. Intell. Res..

[21]  Charles Gretton Gradient-Based Relational Reinforcement Learning of Temporally Extended Policies , 2007, ICAPS.

[22]  Sylvie Thiébaux,et al.  Probabilistic planning vs replanning , 2007 .

[23]  Avrim Blum,et al.  Fast Planning Through Planning Graph Analysis , 1995, IJCAI.

[24]  Richard S. Sutton,et al.  Reinforcement Learning: An Introduction , 1998, IEEE Trans. Neural Networks.

[25]  Peter L. Bartlett,et al.  Infinite-Horizon Policy-Gradient Estimation , 2001, J. Artif. Intell. Res..

[26]  Shigenobu Kobayashi,et al.  Reinforcement Learning in POMDPs with Function Approximation , 1997, ICML.

[27]  Mausam,et al.  Concurrent Probabilistic Temporal Planning , 2005, ICAPS.

[28]  Lex Weaver,et al.  A Multi-Agent Policy-Gradient Approach to Network Routing , 2001, ICML.

[29]  Richard S. Sutton,et al.  Introduction to Reinforcement Learning , 1998 .

[30]  Craig Boutilier,et al.  Sequential Optimality and Coordination in Multiagent Systems , 1999, IJCAI.

[31]  Mausam,et al.  Probabilistic Temporal Planning with Uncertain Durations , 2006, AAAI.

[32]  Alan Fern,et al.  Discriminative Learning of Beam-Search Heuristics for Planning , 2007, IJCAI.

[33]  Geoffrey J. Gordon Reinforcement Learning with Function Approximation Converges to a Region , 2000, NIPS.

[34]  Yishay Mansour,et al.  Policy Gradient Methods for Reinforcement Learning with Function Approximation , 1999, NIPS.

[35]  Håkan L. S. Younes Extending PDDL to Model Stochastic Decision Processes , 2003 .

[36]  Michail G. Lagoudakis,et al.  Coordinated Reinforcement Learning , 2002, ICML.

[37]  R. J. Williams,et al.  Simple Statistical Gradient-Following Algorithms for Connectionist Reinforcement Learning , 2004, Machine Learning.

[38]  Subbarao Kambhampati,et al.  When is Temporal Planning Really Temporal? , 2007, IJCAI.

[39]  Jesse Hoey,et al.  SPUDD: Stochastic Planning using Decision Diagrams , 1999, UAI.

[40]  Scott Sanner,et al.  Practical Linear Value-approximation Techniques for First-order MDPs , 2006, UAI.

[41]  Pau-Lo Hsu,et al.  A cooperative policy for conflict resolution to multi-agent exploration , 2010 .

[42]  David E. Smith,et al.  Conditional Effects in Graphplan , 1998, AIPS.

[43]  Olivier Buffet,et al.  Concurrent Probabilistic Temporal Planning with Policy-Gradients , 2007, ICAPS.

[44]  Håkan L. S. Younes,et al.  PPDDL 1 . 0 : An Extension to PDDL for Expressing Planning Domains with Probabilistic Effects , 2004 .

[45]  Jonathan Baxter,et al.  Scaling Internal-State Policy-Gradient Methods for POMDPs , 2002 .

[46]  Ari K. Jónsson,et al.  MAPGEN: Mixed-Initiative Planning and Scheduling for the Mars Exploration Rover Mission , 2004, IEEE Intell. Syst..

[47]  Maria Fox,et al.  PDDL2.1: An Extension to PDDL for Expressing Temporal Planning Domains , 2003, J. Artif. Intell. Res..

[48]  Håkan L. S. Younes,et al.  Policy Generation for Continuous-time Stochastic Domains with Concurrency , 2004, ICAPS.

[49]  Pierre Priouret,et al.  Adaptive Algorithms and Stochastic Approximations , 1990, Applications of Mathematics.

[50]  Sylvie Thiébaux,et al.  Concurrent Probabilistic Planning in the Graphplan Framework , 2006, ICAPS.

[51]  Sylvie Thiébaux,et al.  Prottle: A Probabilistic Temporal Planner , 2005, AAAI.

[52]  Peter L. Bartlett,et al.  Variance Reduction Techniques for Gradient Estimates in Reinforcement Learning , 2001, J. Mach. Learn. Res..