Temporal-difference (TD) networks have been proposed as a way of representing and learning a wide variety of predictions about the interaction between an agent and its environment (Sutton & Tanner, 2005). These predictions are compositional in that their targets are defined in terms of other predictions, and subjunctive in that that they are about what would happen if an action or sequence of actions were taken. In conventional TD networks, the inter-related predictions are at successive time steps and contingent on a single action; here we generalize them to accommodate extended time intervals and contingency on whole ways of behaving. Our generalization is based on the options framework for temporal abstraction (Sutton, Precup & Singh, 1999). The primary contribution of this paper is to introduce a new algorithm for intra-option learning in TD networks with function approximation and eligibility traces. We present empirical examples of our algorithm’s effectiveness and of the greater representational expressiveness of temporally-abstract TD networks.
[1]
Allen Newell,et al.
SOAR: An Architecture for General Intelligence
,
1987,
Artif. Intell..
[2]
Richard S. Sutton,et al.
Reinforcement Learning: An Introduction
,
1998,
IEEE Trans. Neural Networks.
[3]
Richard S. Sutton,et al.
Reinforcement learning with replacing eligibility traces
,
2004,
Machine Learning.
[4]
Sanjoy Dasgupta,et al.
Off-Policy Temporal Difference Learning with Function Approximation
,
2001,
ICML.
[5]
Richard S. Sutton,et al.
TD Models: Modeling the World at a Mixture of Time Scales
,
1995,
ICML.
[6]
Ronald L. Rivest,et al.
Diversity-Based Inference of Finite Automata (Extended Abstract)
,
1987,
FOCS.
[7]
Richard S. Sutton,et al.
Predictive Representations of State
,
2001,
NIPS.
[8]
Michael R. James,et al.
Predictive State Representations: A New Theory for Modeling Dynamical Systems
,
2004,
UAI.
[9]
Richard S. Sutton,et al.
Learning to predict by the methods of temporal differences
,
1988,
Machine Learning.