Analysis of Some Incremental Variants of Policy Iteration: First Steps Toward Understanding Actor-Cr

This paper studies algorithms based on an incremental dynamic programming abstraction of one of the key issues in understanding the behavior of actor-critic learning systems. The prime example of such a learning system is the ASE/ACE architecture introduced by Barto, Sutton, and Anderson (1983). Also related are Witten's adaptive controller (1977) and Holland's bucket brigade algorithm (1986). The key feature of such a system is the presence of separate adaptive components for action selection and state evaluation, and the key issue focused on here is the extent to which their joint adaptation is guaranteed to lead to optimal behavior in the limit. In the incremental dynamic programming point of view taken here, these questions are formulated in terms of the use of separate data structures for the current best choice of policy and current best estimate of state values, with separate operations used to update each at individual states. Particular emphasis here is on the e ect of complete asynchrony in the updating of these data structures across states. The main results are that, while convergence to optimal performance is not guaranteed in general, there are a number of situations in which such convergence is assured. Since the algorithms investigated represent a certain idealized abstraction of actor-critic learning systems, these results are not directly applicable to current versions of such learning systems but may be viewed instead as providing a useful rst step toward more complete understanding of such systems. Another useful perspective on the algorithms analyzed here is that they represent a broad class of asynchronous dynamic programming procedures based on policy iteration.

[1]  Arthur L. Samuel,et al.  Some Studies in Machine Learning Using the Game of Checkers , 1967, IBM J. Res. Dev..

[2]  Ian H. Witten,et al.  An Adaptive Optimal Controller for Discrete-Time Markov Environments , 1977, Inf. Control..

[3]  M. Puterman,et al.  Modified Policy Iteration Algorithms for Discounted Markov Decision Problems , 1978 .

[4]  Richard S. Sutton,et al.  Temporal credit assignment in reinforcement learning , 1984 .

[5]  Dimitri P. Bertsekas,et al.  Dynamic Programming: Deterministic and Stochastic Models , 1987 .

[6]  Paul J. Werbos,et al.  Building and Understanding Adaptive Systems: A Statistical/Numerical Approach to Factory Automation and Brain Research , 1987, IEEE Transactions on Systems, Man, and Cybernetics.

[7]  Kumpati S. Narendra,et al.  Learning automata - an introduction , 1989 .

[8]  John N. Tsitsiklis,et al.  Parallel and distributed computation , 1989 .

[9]  John N. Tsitsiklis,et al.  Parallel and distributed computation , 1989 .

[10]  L. Baird,et al.  A MATHEMATICAL ANALYSIS OF ACTOR-CRITIC ARCHITECTURES FOR LEARNING OPTIMAL CONTROLS THROUGH INCREMENTAL DYNAMIC PROGRAMMING , 1990 .

[11]  Richard S. Sutton,et al.  Integrated Architectures for Learning, Planning, and Reacting Based on Approximating Dynamic Programming , 1990, ML.

[12]  Richard S. Sutton,et al.  Planning by Incremental Dynamic Programming , 1991, ML.

[13]  Jing Peng,et al.  Efficient Learning and Planning Within the Dyna Framework , 1993, Adapt. Behav..

[14]  Andrew W. Moore,et al.  Memory-based Reinforcement Learning: Converging with Less Data and Less Real Time , 1993 .

[15]  Michael I. Jordan,et al.  MASSACHUSETTS INSTITUTE OF TECHNOLOGY ARTIFICIAL INTELLIGENCE LABORATORY and CENTER FOR BIOLOGICAL AND COMPUTATIONAL LEARNING DEPARTMENT OF BRAIN AND COGNITIVE SCIENCES , 1996 .

[16]  Ben J. A. Kröse,et al.  Learning from delayed rewards , 1995, Robotics Auton. Syst..