Comparing different methods to speed up reinforcement learning in a complex domain

We introduce a new learning algorithm (semi-DP algorithm) designed for MDPs (Markov decision process) where actions either lead to a deterministic successor state or to the terminal state. The algorithm only needs a finite number of loops to converge exactly to the optimal action-value function. We compare this algorithm and three other methods to speed up or simplify the learning process to ordinary Q-learning in a soccer grid-world. Furthermore, we show that different reward functions can considerably change the convergence time of the learning algorithms even if the optimal policy remains unchanged.