Metrics and continuity in reinforcement learning

In most practical applications of reinforcement learning, it is untenable to maintain direct estimates for individual states; in continuous-state systems, it is impossible. Instead, researchers often leverage state similarity (whether explicitly or implicitly) to build models that can generalize well from a limited set of samples. The notion of state similarity used, and the neighbourhoods and topologies they induce, is thus of crucial importance, as it will directly affect the performance of the algorithms. Indeed, a number of recent works introduce algorithms assuming the existence of “well-behaved” neighbourhoods, but leave the full specification of such topologies for future work. In this paper we introduce a unified formalism for defining these topologies through the lens of metrics. We establish a hierarchy amongst these metrics and demonstrate their theoretical implications on the Markov Decision Process specifying the reinforcement learning problem. We complement our theoretical results with empirical evaluations showcasing the differences between the metrics considered.

[1]  K. Jarrod Millman,et al.  Array programming with NumPy , 2020, Nat..

[2]  Yuan Yu,et al.  TensorFlow: A system for large-scale machine learning , 2016, OSDI.

[3]  Kellen Petersen August Real Analysis , 2009 .

[4]  Thomas J. Walsh,et al.  Towards a Unified Theory of State Abstraction for MDPs , 2006, AI&M.

[5]  A. Norets,et al.  Continuity and differentiability of expected value functions in dynamic discrete choice models , 2010 .

[6]  Wilson A. Sutherland,et al.  Introduction to Metric and Topological Spaces , 1975 .

[7]  G. vanRossum Python reference manual , 1995 .

[8]  Doina Precup,et al.  Notions of State Equivalence under Partial Observability , 2009 .

[9]  John Langford,et al.  Exploration in Metric State Spaces , 2003, ICML.

[10]  Rowan McAllister,et al.  Learning Invariant Representations for Reinforcement Learning without Reconstruction , 2020, ICLR.

[11]  Doina Precup,et al.  Using Bisimulation for Policy Transfer in MDPs , 2010, AAAI.

[12]  Nicolas Le Roux,et al.  A Geometric Perspective on Optimal Representations for Reinforcement Learning , 2019, NeurIPS.

[13]  Jason Pazis,et al.  PAC Optimal Exploration in Continuous Space Markov Decision Processes , 2013, AAAI.

[14]  K. I. M. McKinnon,et al.  On the Generation of Markov Decision Processes , 1995 .

[15]  Siddhartha Banerjee,et al.  Adaptive Discretization for Episodic Reinforcement Learning in Metric Spaces , 2019, Proc. ACM Meas. Anal. Comput. Syst..

[16]  Robert Givan,et al.  Equivalence notions and model minimization in Markov decision processes , 2003, Artif. Intell..

[17]  Doina Precup,et al.  Between MDPs and Semi-MDPs: A Framework for Temporal Abstraction in Reinforcement Learning , 1999, Artif. Intell..

[18]  Michail G. Lagoudakis,et al.  On the locality of action domination in sequential decision making , 2010, ISAIM.

[19]  Dongbin Zhao,et al.  MEC—A Near-Optimal Online Reinforcement Learning Algorithm for Continuous Deterministic Systems , 2015, IEEE Transactions on Neural Networks and Learning Systems.

[20]  Matthieu Geist,et al.  Difference of Convex Functions Programming for Reinforcement Learning , 2014, NIPS.

[21]  Nicolas Le Roux,et al.  The Value Function Polytope in Reinforcement Learning , 2019, ICML.

[22]  D. Bertsekas Approximate policy iteration: a survey and some new methods , 2011 .

[23]  Alexandre Proutière,et al.  Exploration in Structured Reinforcement Learning , 2018, NeurIPS.

[24]  Doina Precup,et al.  Metrics for Finite Markov Decision Processes , 2004, AAAI.

[25]  Marc G. Bellemare,et al.  Zooming for Efficient Model-Free Reinforcement Learning in Metric Spaces , 2020, ArXiv.

[26]  Travis E. Oliphant,et al.  Python for Scientific Computing , 2007, Computing in Science & Engineering.

[27]  Marlos C. Machado,et al.  A Laplacian Framework for Option Discovery in Reinforcement Learning , 2017, ICML.

[28]  John D. Hunter,et al.  Matplotlib: A 2D Graphics Environment , 2007, Computing in Science & Engineering.

[29]  Eric Jones,et al.  SciPy: Open Source Scientific Tools for Python , 2001 .

[30]  Martin L. Puterman,et al.  Markov Decision Processes: Discrete Stochastic Dynamic Programming , 1994 .

[31]  Marc G. Bellemare,et al.  The Value-Improvement Path: Towards Better Representations for Reinforcement Learning , 2020, AAAI.

[32]  Alec Solway,et al.  Optimal Behavioral Hierarchy , 2014, PLoS Comput. Biol..

[33]  Zhao Song,et al.  Efficient Model-free Reinforcement Learning in Metric Spaces , 2019, ArXiv.

[34]  Doina Precup,et al.  Bounding Performance Loss in Approximate MDP Homomorphisms , 2008, NIPS.

[35]  Gaël Varoquaux,et al.  The NumPy Array: A Structure for Efficient Numerical Computation , 2011, Computing in Science & Engineering.

[36]  Michael L. Littman,et al.  Near Optimal Behavior via Approximate State Abstraction , 2016, ICML.

[37]  Doina Precup,et al.  Metrics for Markov Decision Processes with Infinite State Spaces , 2005, UAI.

[38]  Marc G. Bellemare,et al.  DeepMDP: Learning Continuous Latent Space Models for Representation Learning , 2019, ICML.

[39]  Pablo Samuel Castro,et al.  Scalable methods for computing state similarity in deterministic Markov Decision Processes , 2019, AAAI.