Compositional Models for Reinforcement Learning

Innovations such as optimistic exploration, function approximation, and hierarchical decomposition have helped scale reinforcement learning to more complex environments, but these three ideas have rarely been studied together. This paper develops a unified framework that formalizes these algorithmic contributions as operators on learned models of the environment. Our formalism reveals some synergies among these innovations, and it suggests a straightforward way to compose them. The resulting algorithm, Fitted R-MAXQ, is the first to combine the function approximation of fitted algorithms, the efficient model-based exploration of R-MAX, and the hierarchical decompostion of MAXQ.

[1]  Michael Kearns,et al.  Near-Optimal Reinforcement Learning in Polynomial Time , 2002, Machine Learning.

[2]  Thomas G. Dietterich Hierarchical Reinforcement Learning with the MAXQ Value Function Decomposition , 1999, J. Artif. Intell. Res..

[3]  Doina Precup,et al.  Between MDPs and Semi-MDPs: A Framework for Temporal Abstraction in Reinforcement Learning , 1999, Artif. Intell..

[4]  Michail G. Lagoudakis,et al.  Least-Squares Policy Iteration , 2003, J. Mach. Learn. Res..

[5]  Geoffrey J. Gordon Stable Function Approximation in Dynamic Programming , 1995, ICML.

[6]  Martin L. Puterman,et al.  Markov Decision Processes: Discrete Stochastic Dynamic Programming , 1994 .

[7]  Michael O. Duff,et al.  Design for an Optimal Probe , 2003, ICML.

[8]  Ronen I. Brafman,et al.  R-MAX - A General Polynomial Time Algorithm for Near-Optimal Reinforcement Learning , 2001, J. Mach. Learn. Res..

[9]  C. Atkeson,et al.  Prioritized Sweeping: Reinforcement Learning with Less Data and Less Time , 1993, Machine Learning.

[10]  Andrew McCallum,et al.  Dynamic conditional random fields: factorized probabilistic models for labeling and segmenting sequence data , 2004, J. Mach. Learn. Res..

[11]  Sham M. Kakade,et al.  On the sample complexity of reinforcement learning. , 2003 .

[12]  Peter Stone,et al.  Hierarchical model-based reinforcement learning: R-max + MAXQ , 2008, ICML '08.

[13]  Martin A. Riedmiller Neural Fitted Q Iteration - First Experiences with a Data Efficient Neural Reinforcement Learning Method , 2005, ECML.

[14]  Andrew W. Moore,et al.  Prioritized Sweeping: Reinforcement Learning with Less Data and Less Time , 1993, Machine Learning.

[15]  Balaraman Ravindran,et al.  SMDP Homomorphisms: An Algebraic Approach to Abstraction in Semi-Markov Decision Processes , 2003, IJCAI.

[16]  Peter Stone,et al.  Model-Based Exploration in Continuous State Spaces , 2007, SARA.

[17]  John Langford,et al.  Cover trees for nearest neighbor , 2006, ICML.

[18]  Sridhar Mahadevan,et al.  Recent Advances in Hierarchical Reinforcement Learning , 2003, Discret. Event Dyn. Syst..