Temporal abstraction in reinforcement learning

Decision making usually involves choosing among different courses of action over a broad range of time scales. For instance, a person planning a trip to a distant location makes high-level decisions regarding what means of transportation to use, but also chooses low-level actions, such as the movements for getting into a car. The problem of picking an appropriate time scale for reasoning and learning has been explored in artificial intelligence, control theory and robotics. In this dissertation we develop a framework that allows novel solutions to this problem, in the context of Markov Decision Processes (MDPs) and reinforcement learning. In this dissertation, we present a general framework for prediction, control and learning at multiple temporal scales. In this framework, temporally extended actions are represented by a way of behaving (a policy) together with a termination condition. An action represented in this way is called an option. Options can be easily incorporated in MDPs, allowing an agent to use existing controllers, heuristics for picking actions, or learned courses of action. The effects of behaving according to an option can be predicted using multi-time models, learned by interacting with the environment. In this dissertation we develop multi-time models, and we illustrate the way in which they can be used to produce plans of behavior very quickly, using classical dynamic programming or reinforcement learning techniques. The most interesting feature of our framework is that it allows an agent to work simultaneously with high-level and low-level temporal representations. The interplay of these levels can be exploited in order to learn and plan more efficiently and more accurately. We develop new algorithms that take advantage of this structure to improve the quality of plans, and to learn in parallel about the effects of many different options.

[1]  G. Carrier,et al.  SINGULAR PERTURBATION METHODS , 1976 .

[2]  Earl David Sacerdoti,et al.  A Structure for Plans and Behavior , 1977 .

[3]  Benjamin Kuipers,et al.  Common-Sense Knowledge of Space: Learning from Experience , 1979, IJCAI.

[4]  J. Hammersley SIMULATION AND THE MONTE CARLO METHOD , 1982 .

[5]  R. Korf Learning to solve problems by searching for macro-operators , 1983 .

[6]  D. Naidu,et al.  Singular Perturbation Analysis of Discrete Control Systems , 1985 .

[7]  Rodney A. Brooks,et al.  A Robust Layered Control Syste For A Mobile Robot , 2022 .

[8]  Allen Newell,et al.  Chunking in Soar , 1986 .

[9]  Richard E. Korf,et al.  Planning as Search: A Quantitative Approach , 1987, Artif. Intell..

[10]  Steven Minton,et al.  Learning search control knowledge , 1988 .

[11]  D. Naidu Singular Perturbation Methodology in Control Systems , 1988 .

[12]  Glenn A. Iba,et al.  A heuristic approach to the discovery of macro-operators , 2004, Machine Learning.

[13]  Stephen F. McCormick,et al.  Multilevel adaptive methods for partial differential equations , 1989, Frontiers in applied mathematics.

[14]  Oren Etzioni,et al.  Why PRODIGY/EBL Works , 1990, AAAI.

[15]  A. Newell,et al.  The Problem of Expensive Chunks and its Solution by Restricting Expressiveness , 1990 .

[16]  Lambert E. Wixson,et al.  Scaling Reinforcement Learning Techniques via Modularity , 1991, ML.

[17]  Roger W. Brockett,et al.  Hybrid Models for Motion Control Systems , 1993 .

[18]  Leslie Pack Kaelbling,et al.  Learning to Achieve Goals , 1993, IJCAI.

[19]  Peter Dayan,et al.  Improving Generalization for Temporal Difference Learning: The Successor Representation , 1993, Neural Computation.

[20]  Marco Colombetti,et al.  Robot Shaping: Developing Autonomous Agents Through Learning , 1994, Artif. Intell..

[21]  Benjamin Kuipers,et al.  Learning to Explore and Build Maps , 1994, AAAI.

[22]  Gerald DeJong,et al.  Learning to Plan in Continuous Domains , 1994, Artif. Intell..

[23]  Mark B. Ring Continual learning in reinforcement environments , 1995, GMD-Bericht.

[24]  Richard S. Sutton,et al.  TD Models: Modeling the World at a Mixture of Time Scales , 1995, ICML.

[25]  Ben J. A. Kröse,et al.  Learning from delayed rewards , 1995, Robotics Auton. Syst..

[26]  Gerald Tesauro,et al.  Temporal difference learning and TD-Gammon , 1995, CACM.

[27]  Selahattin Kuru,et al.  Qualitative System Identification: Deriving Structure from Behavior , 1996, Artif. Intell..

[28]  Paul R. Cohen,et al.  Searching for Planning Operators with Context-Dependent and Probabilistic Effects , 1996, AAAI/IAAI, Vol. 1.

[29]  Gerald DeJong,et al.  A Statistical Approach to Adaptive Problem Solving , 1996, Artif. Intell..

[30]  Jonas Karlsson,et al.  Learning to Solve Multiple Goals , 1997 .

[31]  Ronen I. Brafman,et al.  Modeling Agents as Qualitative Decision Makers , 1997, Artif. Intell..

[32]  Maja J. Mataric,et al.  Reinforcement Learning in the Multi-Robot Domain , 1997, Auton. Robots.

[33]  Roderic A. Grupen,et al.  A feedback control structure for on-line learning tasks , 1997, Robotics Auton. Syst..

[34]  E. B. Baum,et al.  Manifesto for an evolutionary economics of intelligence , 1998 .

[35]  Kee-Eung Kim,et al.  Solving Very Large Weakly Coupled Markov Decision Processes , 1998, AAAI/IAAI.

[36]  Mark D. Pendrith,et al.  RL-TOPS: An Architecture for Modularity and Re-Use in Reinforcement Learning , 1998, ICML.

[37]  Manuela M. Veloso,et al.  Team-Partitioned, Opaque-Transition Reinforced Learning , 1998, RoboCup.