Conditional Random Fields for Reinforcement Learning

Distributed reinforcement learning (RL) involves a collection of nodes, or agents, choosing actions to maximise a long-term reward measure. Examples of such domains are traffic routing for roads or networks, sensor networks, pursuer-evader problems, and job-shop scheduling. The simplest algorithms assume all agents are independent, learning to cooperate only through a shared reward function. More advanced algorithms explicitly share information about state, or factor the global reward into local rewards [1]. But in all these cases each node chooses its action independently. A näıve fix is to make decisions sequentially, allowing nodes to condition their actions on decisions that earlier nodes made. However, we would much prefer that the nodes choose the optimal joint set of actions, taking into account the actions of all other relevant nodes. We use conditional random fields (CRFs) to efficiently model the conditional dependencies between agents. The same inference methods used for CRFs can be used to sample node actions from a joint stochastic policy. We also show how to optimise this joint policy by estimating the gradients of the long-term average reward with respect to the policy parameters. Moreover, similar methods could be used for RL policies based on arbitrary graphical models. CRFs are traditionally used to model P (y |x;θ), which is the probability of a set of labels y, conditioned on observable variables x and the CRF parameters θ [4]. CRF training iterates through sets of training instances {x,y}, finding θ∗ = arg maxθ p(θ|X,Y ). To predict labels for a novel observation x′ we select labels y′ = arg maxy P (y |x′;θ∗). To extend CRFs to online temporal processes Dynamic Bayesian Networks (DBN) have been used, unfolding the CRF model over time. Another interpretation of our work is that we show how CRF parameters can be adapted online for time-series prediction, and control, without needing DBN models. Our RL framework is that of distributed partially observable Markov decisions processes. Each RL agent is represented by a node in the CRF. The input vector x represents the total set of observations/features presented to all the agents. Actions are equivalent to hidden labels y, each element in y representing a single node’s action. The optimisation task is to find the CRF parameters θ such that sampling joint actions y(t) from P (·|x(t);θ) maximises the long-term average reward R(θ) = limT→∞ 1/T ∑T t=1 r(t) (a discounted model may also be used). The CRF/policy distribution is represented as an exponential family P (y |x;θ) = exp(〈φ(x,y),θ〉 − z(θ|x)). Here φ is the sufficient statistic, a vector of features for nodes and edges; and z is the log partition function z(θ|x) := ln ∑ y∈Y exp(〈φ(x, y),θ〉). Node features represent the observation of state available at each node. The edge features encode the communication between nodes about their actions and features. It is worth noting that this exponential family representation, with a dot product between features and parameters, implements exactly the soft-max stochastic policy with linear feature combination commonly encountered in RL applications. Only the edge features prevent trivial factorisation of the distribution into independent agents. Thus the policy distribution is complex to evaluate, however using CRFs allows the clique decomposition theorem to come into play, decomposing the distribution into terms over maximal cliques c ∈ C of the CRF graph so that P (x |y;θ) = exp( ∑ c∈C 〈φc(x,yc),θc〉 − z(θ|x)). For example, in a 1D CRF (chain) the cliques are the set of all adjacent nodes i and j. An often useful clique sufficient statistic in this case is φij(x, yi, yj) = [x, 1] > for connected nodes i, j if yi = yj , and [x, 0] >