An Analysis of Direct Reinforcement Learning in Non-Markovian Domains

It is well known that for Markov decision processes, the policies stable under policy iteration and the standard reinforcement learning methods are exactly the optimal policies. In this paper, we investigate the conditions for policy stability in the more general situation when the Markov property cannot be assumed. We show that for a general class of non-Markov decision processes, if actual return (Monte Carlo) credit assignment is used with undiscounted returns, we are still guaranteed the optimal observation-based policies will be equilibrium points in the policy space when using the standard “direct” reinforcement learning approaches. However, if either discounted rewards, or a temporal differences style of credit assignment method is used, this is not the case.

[1]  Michael L. Littman,et al.  Memoryless policies: theoretical limitations and practical results , 1994 .

[2]  Michael I. Jordan,et al.  Learning Without State-Estimation in Partially Observable Markovian Decision Processes , 1994, ICML.

[3]  P. Spreij Probability and Measure , 1996 .

[4]  Mark D. Pendrith,et al.  Actual Return Reinforcement Learning versus Temporal Differences: Some Theoretical and Experimental Results , 1996, ICML.

[5]  Chris Watkins,et al.  Learning from delayed rewards , 1989 .

[6]  Sridhar Mahadevan,et al.  Sensitive Discount Optimality: Unifying Discounted and Average Reward Reinforcement Learning , 1996, ICML.

[7]  Mark D. Pendrith,et al.  An Analysis of non-Markov Automata Games: Implications for Reinforcement Learning , 1997 .

[8]  K. Narendra,et al.  Decentralized learning in finite Markov chains , 1985, 1985 24th IEEE Conference on Decision and Control.

[9]  Richard S. Sutton,et al.  Neuronlike adaptive elements that can solve difficult learning control problems , 1983, IEEE Transactions on Systems, Man, and Cybernetics.

[10]  Martin L. Puterman,et al.  Markov Decision Processes: Discrete Stochastic Dynamic Programming , 1994 .

[11]  Ian H. Witten,et al.  An Adaptive Optimal Controller for Discrete-Time Markov Environments , 1977, Inf. Control..

[12]  Michael I. Jordan,et al.  Reinforcement Learning Algorithm for Partially Observable Markov Decision Problems , 1994, NIPS.

[13]  Michael L. Littman,et al.  Algorithms for Sequential Decision Making , 1996 .

[14]  Richard S. Sutton,et al.  Reinforcement Learning with Replacing Eligibility Traces , 2005, Machine Learning.

[15]  Andrew G. Barto,et al.  Learning to Act Using Real-Time Dynamic Programming , 1995, Artif. Intell..

[16]  Leslie Pack Kaelbling,et al.  Planning and Acting in Partially Observable Stochastic Domains , 1998, Artif. Intell..