FF + FPG: Guiding a Policy-Gradient Planner

The Factored Policy-Gradient planner (FPG) (Buffet & Aberdeen 2006) was a successful competitor in the probabilistic track of the 2006 International Planning Competition (IPC). FPG is innovative because it scales to large planning domains through the use of Reinforcement Learning. It essentially performs a stochastic local search in policy space. FPG's weakness is potentially long learning times, as it initially acts randomly and progressively improves its policy each time the goal is reached. This paper shows how to use an external teacher to guide FPG's exploration. While any teacher can be used, we concentrate on the actions suggested by FF's heuristic (Hoffmann 2001), as FF-replan has proved efficient for probabilistic re-planning. To achieve this, FPG must learn its own policy while following another. We thus extend FPG to off-policy learning using importance sampling (Glynn & Iglehart 1989; Peshkin & Shelton 2002). The resulting algorithm is presented and evaluated on IPC benchmarks.

[1]  J. Hammersley SIMULATION AND THE MONTE CARLO METHOD , 1982 .

[2]  Donald L. Iglehart,et al.  Importance sampling for stochastic simulations , 1989 .

[3]  R. J. Williams,et al.  Simple Statistical Gradient-Following Algorithms for Connectionist Reinforcement Learning , 2004, Machine Learning.

[4]  Jörg Hoffmann,et al.  FF: The Fast-Forward Planning System , 2001, AI Mag..

[5]  Nicolas Meuleau,et al.  Exploration in Gradient-Based Reinforcement Learning , 2001 .

[6]  Bernhard Nebel,et al.  The FF Planning System: Fast Plan Generation Through Heuristic Search , 2011, J. Artif. Intell. Res..

[7]  Peter L. Bartlett,et al.  Experiments with Infinite-Horizon, Policy-Gradient Estimation , 2001, J. Artif. Intell. Res..

[8]  Christian R. Shelton,et al.  Importance sampling for reinforcement learning with multiple objectives , 2001 .

[9]  Leonid Peshkin,et al.  Learning from Scarce Experience , 2002, ICML.

[10]  Robert Givan,et al.  Approximate Policy Iteration with a Policy Language Bias , 2003, NIPS.

[11]  Håkan L. S. Younes,et al.  The First Probabilistic Track of the International Planning Competition , 2005, J. Artif. Intell. Res..

[12]  Craig Boutilier,et al.  Probabilistic Planning via Linear Value-approximation of First-order MDPs , 2005 .

[13]  Iain Little Paragraph: A Graphplan-based Probabilistic Planner , 2006 .

[14]  ONERA-DCSD Symbolic Stochastic Focused Dynamic Programming with Decision Diagrams , 2006 .

[15]  O. Buffet The Factored Policy Gradient planner ( IPC-06 Version ) , 2006 .

[16]  Alan Fern,et al.  Discriminative Learning of Beam-Search Heuristics for Planning , 2007, IJCAI.

[17]  Robert Givan,et al.  FF-Replan: A Baseline for Probabilistic Planning , 2007, ICAPS.

[18]  Piergiorgio Bertoli,et al.  A Hybridized Planner for Stochastic Domains , 2007, IJCAI.

[19]  Olivier Buffet,et al.  Concurrent Probabilistic Temporal Planning with Policy-Gradients , 2007, ICAPS.