Temporally Extended Metrics for Markov Decision Processes

Developing safe and efficient methods for state abstraction in reinforcement learning systems is an open research problem. We propose to address it by leveraging ideas from formal verification, namely, bisimulation. Specifically, we generalize the notion of bisimulation by considering arbitrary comparisons between states instead of strict reward matching. We further develop a notion of temporally extended metrics, which extend a base metric between states of an environment so as to reflect not just the current difference but the extent to which the distance is preserved through the course of transitions. We show that this property is not satisfied by bisimulation metrics, which were previously used to compare states with respect to their longterm rewards. A temporal extension can be defined for any base metric of interest, thus making the construction very flexible. The kernel of the temporally extended metrics corresponds precisely to exact bisimulation (thus these metrics form a larger class of bisimulation metrics). We provide bounds relating bisimulation and temporally extended metrics and also examine the couplings of state distributions which are induced.