Learning Control of Finite Markov Chains

Abstract Two learning algorithms are presented for the Markovian decision problem in which the transition probabilities are unknown. The algorithms select decisions each time on the basis of the estimates of the unknown probabilities. While one makes a probabilistic selection and the other a deterministic one, they are both devised in view of the dual control problem. It is shown that the probabilistic algorithm converges to an optimal policy. On the other hand, it is shown that the deterministic one only achieves an e-optimal frequency of selecting an optimal policy. But the selecting procedure of the latter is much simpler than that of the former.