Recent Advances in Hierarchical Reinforcement Learning

Reinforcement learning is bedeviled by the curse of dimensionality: the number of parameters to be learned grows exponentially with the size of any compact encoding of a state. Recent attempts to combat the curse of dimensionality have turned to principled ways of exploiting temporal abstraction, where decisions are not required at each step, but rather invoke the execution of temporally-extended activities which follow their own policies until termination. This leads naturally to hierarchical control architectures and associated learning algorithms. We review several approaches to temporal abstraction and hierarchical organization that machine learning researchers have recently developed. Common to these approaches is a reliance on the theory of semi-Markov decision processes, which we emphasize in our review. We then discuss extensions of these ideas to concurrent activities, multiagent coordination, and hierarchical memory for addressing partial observability. Concluding remarks address open challenges facing the further development of reinforcement learning in a hierarchical setting.

[1]  J. Stevens,et al.  Animal Intelligence , 1883, Nature.

[2]  Arthur L. Samuel,et al.  Some Studies in Machine Learning Using the Game of Checkers , 1967, IBM J. Res. Dev..

[3]  A. L. Samuel,et al.  Some studies in machine learning using the game of checkers. II: recent progress , 1967 .

[4]  William A. Woods,et al.  Computational Linguistics Transition Network Grammars for Natural Language Analysis , 2022 .

[5]  A. H. Klopf,et al.  Brain Function and Adaptive Systems: A Heterostatic Theory , 1972 .

[6]  Richard Fikes,et al.  Learning and Executing Generalized Robot Plans , 1993, Artif. Intell..

[7]  P. Varaiya,et al.  Multilayer control of large Markov chains , 1978 .

[8]  A G Barto,et al.  Toward a modern theory of adaptive networks: expectation and prediction. , 1981, Psychological review.

[9]  John S. Edwards,et al.  The Hedonistic Neuron: A Theory of Memory, Learning and Intelligence , 1983 .

[10]  R. Korf Learning to solve problems by searching for macro-operators , 1983 .

[11]  Hassan K. Khalil,et al.  Singular perturbation methods in control : analysis and design , 1986 .

[12]  Rodney A. Brooks,et al.  Achieving Artificial Intelligence through Building Robots , 1986 .

[13]  David Harel,et al.  Statecharts: A Visual Formalism for Complex Systems , 1987, Sci. Comput. Program..

[14]  Dimitri P. Bertsekas,et al.  Dynamic Programming: Deterministic and Stochastic Models , 1987 .

[15]  Paul J. Werbos,et al.  Building and Understanding Adaptive Systems: A Statistical/Numerical Approach to Factory Automation and Brain Research , 1987, IEEE Transactions on Systems, Man, and Cybernetics.

[16]  D. Naidu Singular Perturbation Methodology in Control Systems , 1988 .

[17]  Keiji Kanazawa,et al.  A model for reasoning about persistence and causation , 1989 .

[18]  Satinder P. Singh,et al.  Scaling Reinforcement Learning Algorithms by Learning Variable Temporal Resolution Models , 1992, ML.

[19]  Satinder P. Singh,et al.  Reinforcement Learning with a Hierarchy of Abstract Models , 1992, AAAI.

[20]  Anton Schwartz,et al.  A Reinforcement Learning Method for Maximizing Undiscounted Rewards , 1993, ICML.

[21]  Gerald Tesauro,et al.  TD-Gammon, a Self-Teaching Backgammon Program, Achieves Master-Level Play , 1994, Neural Computation.

[22]  Michael I. Jordan,et al.  MASSACHUSETTS INSTITUTE OF TECHNOLOGY ARTIFICIAL INTELLIGENCE LABORATORY and CENTER FOR BIOLOGICAL AND COMPUTATIONAL LEARNING DEPARTMENT OF BRAIN AND COGNITIVE SCIENCES , 1996 .

[23]  Mahesan Niranjan,et al.  On-line Q-learning using connectionist systems , 1994 .

[24]  Michael L. Littman,et al.  Markov Games as a Framework for Multi-Agent Reinforcement Learning , 1994, ICML.

[25]  Michael O. Duff,et al.  Reinforcement Learning Methods for Continuous-Time Markov Decision Problems , 1994, NIPS.

[26]  Sebastian Thrun,et al.  Finding Structure in Reinforcement Learning , 1994, NIPS.

[27]  Ben J. A. Kröse,et al.  Learning from delayed rewards , 1995, Robotics Auton. Syst..

[28]  Richard S. Sutton,et al.  Generalization in Reinforcement Learning: Successful Examples Using Sparse Coarse Coding , 1995, NIPS.

[29]  Illah R. Nourbakhsh,et al.  DERVISH - An Office-Navigating Robot , 1995, AI Mag..

[30]  Andrew G. Barto,et al.  Learning to Act Using Real-Time Dynamic Programming , 1995, Artif. Intell..

[31]  Pattie Maes,et al.  Emergent Hierarchical Control Structures: Learning Reactive/Hierarchical Relationships in Reinforcement Environments , 1996 .

[32]  Andrew W. Moore,et al.  Reinforcement Learning: A Survey , 1996, J. Artif. Intell. Res..

[33]  Andrew McCallum,et al.  Reinforcement learning with selective perception and hidden state , 1996 .

[34]  John N. Tsitsiklis,et al.  Neuro-Dynamic Programming , 1996, Encyclopedia of Machine Learning.

[35]  Andrew G. Barto,et al.  Large-scale dynamic optimization using teams of reinforcement learning agents , 1996 .

[36]  Thomas G. Dietterich What is machine learning? , 2020, Archives of Disease in Childhood.

[37]  John N. Tsitsiklis,et al.  Analysis of Temporal-Diffference Learning with Function Approximation , 1996, NIPS.

[38]  Dimitri P. Bertsekas,et al.  Reinforcement Learning for Dynamic Channel Allocation in Cellular Telephone Systems , 1996, NIPS.

[39]  Leslie Pack Kaelbling,et al.  Learning Topological Maps with Weak Local Odometric Information , 1997, IJCAI.

[40]  R. Simmons,et al.  Xavier: A Robot Navigation Architecture Based on Partially Observable Markov Decision Process Models , 1998 .

[41]  Stuart J. Russell,et al.  Reinforcement Learning with Hierarchies of Machines , 1997, NIPS.

[42]  Doina Precup,et al.  Multi-time Models for Temporally Abstract Planning , 1997, NIPS.

[43]  Roderic A. Grupen,et al.  A feedback control structure for on-line learning tasks , 1997, Robotics Auton. Syst..

[44]  Doina Precup,et al.  Theoretical Results on Reinforcement Learning with Temporally Abstract Options , 1998, ECML.

[45]  Leslie Pack Kaelbling,et al.  Planning and Acting in Partially Observable Stochastic Domains , 1998, Artif. Intell..

[46]  Ronald E. Parr,et al.  Hierarchical control and learning for markov decision processes , 1998 .

[47]  V. Borkar,et al.  A unified framework for hybrid control: model and optimal control theory , 1998, IEEE Trans. Autom. Control..

[48]  Bruce L. Digney,et al.  Learning hierarchical control structures for multiple tasks and changing environments , 1998 .

[49]  Xavier Boyen,et al.  Tractable Inference for Complex Stochastic Processes , 1998, UAI.

[50]  Jean-Arcady Meyer,et al.  Learning Hierarchical Control Structures for Multiple Tasks and Changing Environments , 1998 .

[51]  Richard S. Sutton,et al.  Introduction to Reinforcement Learning , 1998 .

[52]  Doina Precup,et al.  Between MDPs and Semi-MDPs: A Framework for Temporal Abstraction in Reinforcement Learning , 1999, Artif. Intell..

[53]  S. Mahadevan,et al.  Solving Semi-Markov Decision Problems Using Average Reward Reinforcement Learning , 1999 .

[54]  Neil Immerman,et al.  The Complexity of Decentralized Control of Markov Decision Processes , 2000, UAI.

[55]  Thomas G. Dietterich Hierarchical Reinforcement Learning with the MAXQ Value Function Decomposition , 1999, J. Artif. Intell. Res..

[56]  Gregory Z. Grudic,et al.  Localizing Search in Reinforcement Learning , 2000, AAAI/IAAI.

[57]  Sridhar Mahadevan,et al.  Hierarchical Memory-Based Reinforcement Learning , 2000, NIPS.

[58]  Doina Precup,et al.  Temporal abstraction in reinforcement learning , 2000, ICML 2000.

[59]  Andrew G. Barto,et al.  Automated State Abstraction for Options using the U-Tree Algorithm , 2000, NIPS.

[60]  David Andre,et al.  Programmable Reinforcement Learning Agents , 2000, NIPS.

[61]  Guillermo Ricardo Simari,et al.  Multiagent systems: a modern approach to distributed artificial intelligence , 2000 .

[62]  Andrew G. Barto,et al.  Automatic Discovery of Subgoals in Reinforcement Learning using Diverse Density , 2001, ICML.

[63]  Sridhar Mahadevan,et al.  Decision-Theoretic Planning with Concurrent Temporally Extended Actions , 2001, UAI.

[64]  Sridhar Mahadevan,et al.  Learning Hierarchical Partially Observable Markov Decision Process Models for Robot Navigation , 2001 .

[65]  Peter Stone,et al.  Keepaway Soccer: A Machine Learning Testbed , 2001, RoboCup.

[66]  Sridhar Mahadevan,et al.  Continuous-Time Hierarchical Reinforcement Learning , 2001, ICML.

[67]  Andrew G. Barto,et al.  Lyapunov-Constrained Action Sets for Reinforcement Learning , 2001, ICML.

[68]  Peter Stone,et al.  Scaling Reinforcement Learning toward RoboCup Soccer , 2001, ICML.

[69]  Andrew G. Barto,et al.  Autonomous discovery of temporal abstractions from interaction with an environment , 2002 .

[70]  Bernhard Hengst,et al.  Discovering Hierarchy in Reinforcement Learning with HEXQ , 2002, ICML.

[71]  Andrew G. Barto,et al.  Lyapunov Design for Safe Reinforcement Learning , 2003, J. Mach. Learn. Res..

[72]  Sridhar Mahadevan,et al.  Hierarchically Optimal Average Reward Reinforcement Learning , 2002, ICML.

[73]  Saso Dzeroski,et al.  Integrating Experimentation and Guidance in Relational Reinforcement Learning , 2002, ICML.

[74]  Zhiyuan Ren,et al.  A time aggregation approach to Markov decision processes , 2002, Autom..

[75]  Sridhar Mahadevan,et al.  Learning to Take Concurrent Actions , 2002, NIPS.

[76]  Peter Dayan,et al.  Dopamine: generalization and bonuses , 2002, Neural Networks.

[77]  Sridhar Mahadevan,et al.  Hierarchical learning and planning in partially observable markov decision processes , 2002 .

[78]  Sridhar Mahadevan,et al.  Approximate planning with hierarchical partially observable Markov decision process models for robot navigation , 2002, Proceedings 2002 IEEE International Conference on Robotics and Automation (Cat. No.02CH37292).

[79]  Peter Dayan,et al.  Q-learning , 1992, Machine Learning.

[80]  Victor R. Lesser,et al.  Learning to Improve Coordinated Actions in Cooperative Distributed Problem-Solving Environments , 1998, Machine Learning.

[81]  Glenn A. Iba,et al.  A Heuristic Approach to the Discovery of Macro-Operators , 1989, Machine Learning.

[82]  Gerald Tesauro,et al.  Practical issues in temporal difference learning , 1992, Machine Learning.

[83]  Tommi S. Jaakkola,et al.  Convergence Results for Single-Step On-Policy Reinforcement-Learning Algorithms , 2000, Machine Learning.

[84]  Sridhar Mahadevan,et al.  Average reward reinforcement learning: Foundations, algorithms, and empirical results , 2004, Machine Learning.

[85]  Andrew G. Barto,et al.  Elevator Group Control Using Multiple Reinforcement Learning Agents , 1998, Machine Learning.

[86]  Yoram Singer,et al.  The Hierarchical Hidden Markov Model: Analysis and Applications , 1998, Machine Learning.

[87]  Sean R Eddy,et al.  What is dynamic programming? , 2004, Nature Biotechnology.

[88]  Oliver Lemon,et al.  Spoken Dialogue Management Using Hierarchical Reinforcement Learning and Dialogue Simulation , 2005 .

[89]  Jun Morimoto,et al.  Robust Reinforcement Learning , 2005, Neural Computation.

[90]  Richard S. Sutton,et al.  Reinforcement Learning: An Introduction , 1998, IEEE Trans. Neural Networks.