Policy Invariance Under Reward Transformations: Theory and Application to Reward Shaping

This paper investigates conditions under which modi cations to the reward function of a Markov decision process preserve the op timal policy It is shown that besides the positive linear transformation familiar from utility theory one can add a reward for tran sitions between states that is expressible as the di erence in value of an arbitrary poten tial function applied to those states Further more this is shown to be a necessary con dition for invariance in the sense that any other transformation may yield suboptimal policies unless further assumptions are made about the underlying MDP These results shed light on the practice of reward shap ing a method used in reinforcement learn ing whereby additional training rewards are used to guide the learning agent In par ticular some well known bugs in reward shaping procedures are shown to arise from non potential based rewards and methods are given for constructing shaping potentials corresponding to distance based and subgoal based heuristics We show that such po tentials can lead to substantial reductions in learning time