On Reinforcement Learning of Control Actions in Noisy and Non-Markovian Domains

If reinforcement learning RL techniques are to be used for real world dynamic system control the problems of noise and plant disturbance will have to be addressed This study investigates the e ects of noise disturbance on ve di erent RL algorithms Watkins Q Learning QL Barto Sutton and Ander son s Adaptive Heuristic Critic AHC Sammut and Law s modern variant of Michie and Chamber s BOXES algorithm and two new algorithms developed during the course of this study Both these new algorithms are conceptually re lated to QL both algorithms called P Trace and Q Trace respectively provide for substantially faster learning than straight QL overall and for dramatically faster learning by up to a factor of in the special case of learning in a noisy environment for the dynamic system studied here a pole and cart simulation As well as speeding learning both the P Trace and Q Trace algorithms have been designed to preserve the convergence with probability formal properties of standard QL i e that they be provably correct algorithms for Markovian domains for the same conditions that QL is guaranteed to be correct We present both arguments and experimental evidence that trace methods may prove to be both faster and more powerful in general than TD Temporal Di erence methods The potential performance improvements using trace over pure TD methods may turn out to be particularly important when learning is to occur in noisy or stochastic environments and in the case where the domain is not well modelled by Markovian processes A surprising result to emerge from this study is evidence for hitherto un suspected chaotic behaviour with respect to learning rates exhibited by the well studied AHC algorithm The e ect becomes more pronounced as noise increases

[1]  Richard S. Sutton,et al.  Neuronlike adaptive elements that can solve difficult learning control problems , 1983, IEEE Transactions on Systems, Man, and Cybernetics.

[2]  Claude Sammut,et al.  Experimental Results from an Evaluation of Algorithms that Learn to Control Dynamic Systems , 1988, ML.

[3]  C.W. Anderson,et al.  Learning to control an inverted pendulum using neural networks , 1989, IEEE Control Systems Magazine.

[4]  Richard S. Sutton,et al.  Reinforcement Learning is Direct Adaptive Optimal Control , 1992, 1991 American Control Conference.

[5]  R.J. Williams,et al.  Reinforcement learning is direct adaptive optimal control , 1991, IEEE Control Systems.

[6]  Gerald Tesauro,et al.  Practical Issues in Temporal Difference Learning , 1992, Mach. Learn..

[7]  Sridhar Mahadevan,et al.  To Discount or Not to Discount in Reinforcement Learning: A Case Study Comparing R Learning and Q Learning , 1994, ICML.

[8]  Richard W. Prager,et al.  A Modular Q-Learning Architecture for Manipulator Task Decomposition , 1994, ICML.

[9]  Maja J. Mataric,et al.  Reward Functions for Accelerated Learning , 1994, ICML.

[10]  Michael I. Jordan,et al.  Learning Without State-Estimation in Partially Observable Markovian Decision Processes , 1994, ICML.

[11]  Ben J. A. Kröse,et al.  Learning from delayed rewards , 1995, Robotics Auton. Syst..

[12]  Thomas G. Dietterich Machine learning , 1996, CSUR.

[13]  Peter Dayan,et al.  Technical Note: Q-Learning , 1992, Machine Learning.

[14]  Jing Peng,et al.  Incremental multi-step Q-learning , 1994, Machine Learning.