Probabilistic Planning via Linear Value-approximation of First-order MDPs

We describe a probabilistic planning approach that translates a PPDDL planning problem description to a first-order MDP (FOMDP) and uses approximate solution techniques for FOMDPs to derive a value function and corresponding policy. Our FOMDP solution techniques represent the value function linearly w.r.t. a set of first-order basis functions and compute suitable weights using lifted, first-order extensions of approximate linear programming (FOALP) and approximate policy iteration (FOAPI) for MDPs. We additionally describe techniques for automatic basis function generation and decomposition of universal rewards that are crucial to achieve autonomous and tractable FOMDP solutions for many planning domains. From PPDDL to First-order MDPs It is straightforward to translate a PPDDL [12] planning domain into the situation calculus representation used for firstorder MDPs (FOMDPs); the primary part of this translation requires the conversion of PPDDL action schemata to effect axioms in the situation calculus, which are then compiled into successor-state axioms [8] used in the FOMDP description. In the following algorithm description, we will assume that we are given a FOMDP specification and we will describe techniques for approximating its value function linearly w.r.t. a set of first-order basis functions. From this value function it is straightforward to derive a first-order policy representation that can be used for action selection in the original PPDDL planning domain. Linear Value Approximation for FOMDPs The following explanation assumes the reader is familiar with the FOMDP formalism and operators used in Boutilier, Reiter and Price [2] and extended by Sanner and Boutilier [9]. In the following text, we will refer to function symbols Ai(~x) that correspond to parameterized actions in the FOMDP; for every action and fluent, we expect that a successor state axiom has been defined. The reader should be familiar with the notation and use of the rCase, vCase, and pCase case statements for representing the respective FOMDP reward, value, and transition functions. The reader should also be familiar with the case operators ⊕, , ∪, and Regr(·) [2] as well as FODTR(·), B(·), and B(·) [9]. Value Function Representation Following [9], we represent a value function as a weighted sum of k first-order basis functions in case statement format, denoted bCasej(s), each containing a small number of formulae that provide a first-order abstraction of state space: vCase(s) = ⊕ki=1 wi · bCasei(s) (1) Using this format, we can often achieve a reasonable approximation of the exact value function by exploiting the additive structure inherent in many real-world problems (e.g., additive reward functions or problems with independent subgoals). Unlike exact solution methods where value functions can grow exponentially in size during the solution process and must be logically simplified [2], here we maintain the value function in a compact form that requires no simplification, just discovery of good weights. We can easily apply the FOMDP backup operator B [9] to this representation and obtain some simplification as a result of the structure in Eq. 1. Exploiting the properties of the Regr and ⊕ operators, we find that the backup B of a linear combination of basis functions is simply the linear combination of the first-order decision-theoretic regression (FODTR) of each basis function [9]: B (⊕i wibCasei(s)) = (2) rCase(s, a) ⊕ (⊕i wiFODTR(bCasei(s), A(~x))) A corresponding definition of B follows directly [9]. It is important to note that during the application of these operators, we never explicitly ground states or actions, in effect achieving both state and action space abstraction. First-order Approximate Linear Programming First-order approximate linear programming (FOALP) was introduced by Sanner and Boutilier [9]. Here we present a linear program (LP) with first-order constraints that generalizes the solution from MDPs to FOMDPs: Variables: wi ; ∀i ≤ k