On the Computational Economics of Reinforcement Learning

Abstract Following terminology used in adaptive control, we distinguish between indirect learning methods, which learn explicit models of the dynamic structure of the system to be controlled, and direct learning methods, which do not. We compare an existing indirect method, which uses a conventional dynamic programming algorithm, with a closely related direct reinforcement learning method by applying both methods to an infinite horizon Markov decision problem with unknown state-transition probabilities. The simulations show that although the direct method requires much less space and dramatically less computation per control action, its learning ability in this task is superior to, or compares favorably with, that of the more complex indirect method. Although these results do not address how the methods’ performances compare as problems become more difficult, they suggest that given a fixed amount of computational power available per control action, it may be better to use a direct reinforcement learning method augmented with indirect techniques than to devote all available resources to a computationally costly indirect method. Comprehensive answers to the questions raised by this study depend on many factors making up the economic context of the computation.

[1]  A. H. Klopf,et al.  Brain Function and Adaptive Systems: A Heterostatic Theory , 1972 .

[2]  Y. M. El-Fattah,et al.  Recursive Algorithms for Adaptive Control of Finite Markov Chains , 1981 .

[3]  Ian H. Witten,et al.  An Adaptive Optimal Controller for Discrete-Time Markov Environments , 1977, Inf. Control..

[4]  Kumpati S. Narendra,et al.  Learning automata - an introduction , 1989 .

[5]  Richard S. Sutton,et al.  Learning and Sequential Decision Making , 1989 .

[6]  Richard S. Sutton,et al.  Integrated Architectures for Learning, Planning, and Reacting Based on Approximating Dynamic Programming , 1990, ML.

[7]  Mitsuo Sato,et al.  Learning control of finite Markov chains with unknown transition probabilities , 1982 .

[8]  Paul J. Werbos,et al.  Building and Understanding Adaptive Systems: A Statistical/Numerical Approach to Factory Automation and Brain Research , 1987, IEEE Transactions on Systems, Man, and Cybernetics.

[9]  P. Anandan,et al.  Pattern-recognizing stochastic learning automata , 1985, IEEE Transactions on Systems, Man, and Cybernetics.

[10]  Michael I. Jordan,et al.  Learning to Control an Unstable System with Forward Modeling , 1989, NIPS.

[11]  J. S. Riordon An adaptive automaton controller for discrete-time Markov processes , 1969 .

[12]  Richard S. Sutton,et al.  Neuronlike adaptive elements that can solve difficult learning control problems , 1983, IEEE Transactions on Systems, Man, and Cybernetics.

[13]  PAUL J. WERBOS,et al.  Generalization of backpropagation with application to a recurrent gas market model , 1988, Neural Networks.

[14]  Graham C. Goodwin,et al.  Adaptive filtering prediction and control , 1984 .

[15]  Y. Chien,et al.  Pattern classification and scene analysis , 1974 .

[16]  Mitsuo Sato,et al.  An asymptotically optimal learning controller for finite Markov chains with unknown transition probabilities , 1985 .

[17]  Dimitri P. Bertsekas,et al.  Dynamic Programming: Deterministic and Stochastic Models , 1987 .

[18]  Richard S. Sutton,et al.  Sequential Decision Problems and Neural Networks , 1989, NIPS 1989.

[19]  Richard Wheeler,et al.  Decentralized learning in finite Markov chains , 1985, 1985 24th IEEE Conference on Decision and Control.

[20]  Ian H. Witten Exploring, Modelling, and Controlling Discrete Sequential Environments , 1977 .

[21]  Paul J. Werbos,et al.  Neural networks for control and system identification , 1989, Proceedings of the 28th IEEE Conference on Decision and Control,.

[22]  P. Kumar,et al.  Optimal adaptive controllers for unknown Markov chains , 1982 .

[23]  P. Mandl,et al.  Estimation and control in Markov chains , 1974, Advances in Applied Probability.

[24]  Charles W. Anderson,et al.  Strategy Learning with Multilayer Connectionist Representations , 1987 .

[25]  V. Borkar,et al.  Adaptive control of Markov chains, I: Finite parameter set , 1979, 1979 18th IEEE Conference on Decision and Control including the Symposium on Adaptive Processes.

[26]  Richard S. Sutton,et al.  Temporal credit assignment in reinforcement learning , 1984 .

[27]  K. Narendra,et al.  Decentralized learning in finite Markov chains , 1985, 1985 24th IEEE Conference on Decision and Control.

[28]  MITSUO SATO,et al.  Learning control of finite Markov chains with an explicit trade-off between estimation and control , 1988, IEEE Trans. Syst. Man Cybern..

[29]  Chris Watkins,et al.  Learning from delayed rewards , 1989 .