A Geometric Perspective on Optimal Representations for Reinforcement Learning

We propose a new perspective on representation learning in reinforcement learning based on geometric properties of the space of value functions. We leverage this perspective to provide formal evidence regarding the usefulness of value functions as auxiliary tasks. Our formulation considers adapting the representation to minimize the (linear) approximation of the value function of all stationary policies for a given environment. We show that this optimization reduces to making accurate predictions regarding a special class of value functions which we call adversarial value functions (AVFs). We demonstrate that using value functions as auxiliary tasks corresponds to an expected-error relaxation of our formulation, with AVFs a natural candidate, and identify a close relationship with proto-value functions (Mahadevan, 2005). We highlight characteristics of AVFs and their usefulness as auxiliary tasks in a series of experiments on the four-room domain.

[1]  Richard L. Lewis,et al.  Discovery of Useful Questions as Auxiliary Tasks , 2019, NeurIPS.

[2]  Richard S. Sutton,et al.  TD Models: Modeling the World at a Mixture of Time Scales , 1995, ICML.

[3]  Martin L. Puterman,et al.  Markov Decision Processes: Discrete Stochastic Dynamic Programming , 1994 .

[4]  Tom Schaul,et al.  Successor Features for Transfer in Reinforcement Learning , 2016, NIPS.

[5]  Rémi Munos,et al.  Error Bounds for Approximate Policy Iteration , 2003, ICML.

[6]  Matteo Hessel,et al.  Deep Reinforcement Learning and the Deadly Triad , 2018, ArXiv.

[7]  Richard S. Sutton,et al.  Reinforcement Learning: An Introduction , 1998, IEEE Trans. Neural Networks.

[8]  Patrick M. Pilarski,et al.  Horde: a scalable real-time architecture for learning knowledge from unsupervised sensorimotor interaction , 2011, AAMAS.

[9]  Dimitri P. Bertsekas,et al.  Dynamic Programming and Optimal Control, Vol. II , 1976 .

[10]  Sean R Eddy,et al.  What is dynamic programming? , 2004, Nature Biotechnology.

[11]  Richard S. Sutton,et al.  Generalization in Reinforcement Learning: Successful Examples Using Sparse Coarse Coding , 1995, NIPS.

[12]  Marc G. Bellemare,et al.  DeepMDP: Learning Continuous Latent Space Models for Representation Learning , 2019, ICML.

[13]  Marc G. Bellemare,et al.  Dopamine: A Research Framework for Deep Reinforcement Learning , 2018, ArXiv.

[14]  Vladlen Koltun,et al.  Learning to Act by Predicting the Future , 2016, ICLR.

[15]  Marlos C. Machado,et al.  State of the Art Control of Atari Games Using Shallow Reinforcement Learning , 2015, AAMAS.

[16]  Dimitri P. Bertsekas,et al.  Dynamic Programming and Optimal Control, Two Volume Set , 1995 .

[17]  Marc G. Bellemare,et al.  The Arcade Learning Environment: An Evaluation Platform for General Agents (Extended Abstract) , 2012, IJCAI.

[18]  Arthur L. Samuel,et al.  Some Studies in Machine Learning Using the Game of Checkers , 1967, IBM J. Res. Dev..

[19]  Oriol Vinyals,et al.  Representation Learning with Contrastive Predictive Coding , 2018, ArXiv.

[20]  Marlos C. Machado,et al.  Eigenoption Discovery through the Deep Successor Representation , 2017, ICLR.

[21]  Shane Legg,et al.  Human-level control through deep reinforcement learning , 2015, Nature.

[22]  Shie Mannor,et al.  Regularized Policy Iteration with Nonparametric Function Spaces , 2016, J. Mach. Learn. Res..

[23]  Marek Petrik,et al.  An Analysis of Laplacian Methods for Value Function Approximation in MDPs , 2007, IJCAI.

[24]  Marek Petrik,et al.  Robust Approximate Bilinear Programming for Value Function Approximation , 2011, J. Mach. Learn. Res..

[25]  Nicolas Le Roux,et al.  The Value Function Polytope in Reinforcement Learning , 2019, ICML.

[26]  Shie Mannor,et al.  Basis Function Adaptation in Temporal Difference Reinforcement Learning , 2005, Ann. Oper. Res..

[27]  Stephen P. Boyd,et al.  Convex Optimization , 2004, Algorithms and Theory of Computation Handbook.

[28]  R. Tyrrell Rockafellar,et al.  Convex Analysis , 1970, Princeton Landmarks in Mathematics and Physics.

[29]  Samuel Gershman,et al.  Design Principles of the Hippocampal Cognitive Map , 2014, NIPS.

[30]  Gerald Tesauro,et al.  Temporal difference learning and TD-Gammon , 1995, CACM.

[31]  Dimitri P. Bertsekas,et al.  Feature-based aggregation and deep reinforcement learning: a survey and some new implementations , 2018, IEEE/CAA Journal of Automatica Sinica.

[32]  Peter Dayan,et al.  Improving Generalization for Temporal Difference Learning: The Successor Representation , 1993, Neural Computation.

[33]  Lawrence Carin,et al.  Linear Feature Encoding for Reinforcement Learning , 2016, NIPS.

[34]  Rémi Munos,et al.  Performance Bounds in Lp-norm for Approximate Value Iteration , 2007, SIAM J. Control. Optim..

[35]  Yifan Wu,et al.  The Laplacian in RL: Learning Representations with Efficient Approximations , 2018, ICLR.

[36]  Marcello Restelli,et al.  Boosted Fitted Q-Iteration , 2017, ICML.

[37]  Pierre Geurts,et al.  Tree-Based Batch Mode Reinforcement Learning , 2005, J. Mach. Learn. Res..

[38]  Mahesan Niranjan,et al.  On-line Q-learning using connectionist systems , 1994 .

[39]  Peter Dayan,et al.  Structure in the Space of Value Functions , 2002, Machine Learning.

[40]  Thomas J. Walsh,et al.  Towards a Unified Theory of State Abstraction for MDPs , 2006, AI&M.

[41]  Marc G. Bellemare,et al.  An Atari Model Zoo for Analyzing, Visualizing, and Comparing Deep Reinforcement Learning Agents , 2018, IJCAI.

[42]  Shie Mannor,et al.  Automatic basis function construction for approximate dynamic programming and reinforcement learning , 2006, ICML.

[43]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[44]  Bastian Goldlücke,et al.  Variational Analysis , 2014, Computer Vision, A Reference Guide.

[45]  Alec Solway,et al.  Optimal Behavioral Hierarchy , 2014, PLoS Comput. Biol..

[46]  Marcus Hutter,et al.  Feature Reinforcement Learning: Part I. Unstructured MDPs , 2009, J. Artif. Gen. Intell..

[47]  Sergey Levine,et al.  Near-Optimal Representation Learning for Hierarchical Reinforcement Learning , 2018, ICLR.

[48]  Richard S. Sutton,et al.  Predictive Representations of State , 2001, NIPS.

[49]  Lihong Li,et al.  An analysis of linear models, linear value-function approximation, and feature selection for reinforcement learning , 2008, ICML '08.

[50]  Doina Precup,et al.  Sparse Distributed Memories for On-Line Value-Based Reinforcement Learning , 2004, ECML.

[51]  Vivek S. Borkar,et al.  Feature Search in the Grassmanian in Online Reinforcement Learning , 2013, IEEE Journal of Selected Topics in Signal Processing.

[52]  Doina Precup,et al.  Between MDPs and Semi-MDPs: A Framework for Temporal Abstraction in Reinforcement Learning , 1999, Artif. Intell..

[53]  Michail G. Lagoudakis,et al.  Least-Squares Policy Iteration , 2003, J. Mach. Learn. Res..

[54]  Lihong Li,et al.  Analyzing feature generation for value-function approximation , 2007, ICML '07.

[55]  Bo Liu,et al.  Basis Construction from Power Series Expansions of Value Functions , 2010, NIPS.

[56]  O. G. Selfridge,et al.  Pandemonium: a paradigm for learning , 1988 .

[57]  Joelle Pineau,et al.  Combined Reinforcement Learning via Abstract Representations , 2018, AAAI.

[58]  Demis Hassabis,et al.  Mastering the game of Go with deep neural networks and tree search , 2016, Nature.

[59]  Bahram Behzadian,et al.  Feature Selection by Singular Value Decomposition for Reinforcement Learning , 2019 .

[60]  Tom Schaul,et al.  Universal Value Function Approximators , 2015, ICML.

[61]  Marc G. Bellemare,et al.  A Distributional Perspective on Reinforcement Learning , 2017, ICML.

[62]  Jens Vygen,et al.  The Book Review Column1 , 2020, SIGACT News.

[63]  Yishay Mansour,et al.  Policy Gradient Methods for Reinforcement Learning with Function Approximation , 1999, NIPS.

[64]  Yuan Yu,et al.  TensorFlow: A system for large-scale machine learning , 2016, OSDI.

[65]  Marlos C. Machado,et al.  A Laplacian Framework for Option Discovery in Reinforcement Learning , 2017, ICML.

[66]  Sridhar Mahadevan,et al.  Proto-value Functions: A Laplacian Framework for Learning Representation and Control in Markov Decision Processes , 2007, J. Mach. Learn. Res..

[67]  Dimitri P. Bertsekas,et al.  Basis function adaptation methods for cost approximation in MDP , 2009, 2009 IEEE Symposium on Adaptive Dynamic Programming and Reinforcement Learning.

[68]  Shie Mannor,et al.  Shallow Updates for Deep Reinforcement Learning , 2017, NIPS.

[69]  Michael L. Littman,et al.  Near Optimal Behavior via Approximate State Abstraction , 2016, ICML.

[70]  Martha White,et al.  An Emphatic Approach to the Problem of Off-policy Temporal-Difference Learning , 2015, J. Mach. Learn. Res..

[71]  Tom Schaul,et al.  Reinforcement Learning with Unsupervised Auxiliary Tasks , 2016, ICLR.

[72]  Doina Precup,et al.  Representation Discovery for MDPs Using Bisimulation Metrics , 2015, AAAI.

[73]  Marcin Andrychowicz,et al.  Hindsight Experience Replay , 2017, NIPS.

[74]  Martha White,et al.  Two-Timescale Networks for Nonlinear Value Function Approximation , 2019, ICLR.