Learning Relative Return Policies With Upside-Down Reinforcement Learning

Lately, there has been a resurgence of interest in using supervised learning to solve reinforcement learning problems. Recent work in this area has largely focused on learning command-conditioned policies. We investigate the potential of one such method—upside-down reinforcement learning—to work with commands that specify a desired relationship between some scalar value and the observed return. We show that upside-down reinforcement learning can learn to carry out such commands online in a tabular bandit setting and in CartPole with non-linear function approximation. By doing so, we demonstrate the power of this family of methods and open the way for their practical use under more complicated command structures.

[1]  Tom Schaul,et al.  Episodic Reinforcement Learning by Logistic Reward-Weighted Regression , 2008, ICANN.

[2]  Stefan Schaal,et al.  Reinforcement learning by reward-weighted regression for operational space control , 2007, ICML '07.

[3]  Lukasz Kaiser,et al.  Attention is All you Need , 2017, NIPS.

[4]  Sergey Levine,et al.  Reward-Conditioned Policies , 2019, ArXiv.

[5]  S. Gu,et al.  Generalized Decision Transformer for Offline Hindsight Information Matching , 2021, ICLR.

[6]  Jan Peters,et al.  Noname manuscript No. (will be inserted by the editor) Policy Search for Motor Primitives in Robotics , 2022 .

[7]  Geoffrey E. Hinton,et al.  Using Expectation-Maximization for Reinforcement Learning , 1997, Neural Computation.

[8]  Juergen Schmidhuber,et al.  Reinforcement Learning Upside Down: Don't Predict Rewards - Just Map Them to Actions , 2019, ArXiv.

[9]  Masashi Sugiyama,et al.  Efficient Sample Reuse in EM-Based Policy Search , 2009, ECML/PKDD.

[10]  Tom Schaul,et al.  Fitness Expectation Maximization , 2008, PPSN.

[11]  Filipe Wall Mutz,et al.  Training Agents using Upside-Down Reinforcement Learning , 2019, ArXiv.

[12]  Sergey Levine,et al.  Learning to Reach Goals via Iterated Supervised Learning , 2019, ICLR.

[13]  Richard S. Sutton,et al.  Neuronlike adaptive elements that can solve difficult learning control problems , 1983, IEEE Transactions on Systems, Man, and Cybernetics.

[14]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[15]  Sergey Levine,et al.  Advantage-Weighted Regression: Simple and Scalable Off-Policy Reinforcement Learning , 2019, ArXiv.

[16]  Pieter Abbeel,et al.  Decision Transformer: Reinforcement Learning via Sequence Modeling , 2021, NeurIPS.

[17]  Masashi Sugiyama,et al.  Reward-Weighted Regression with Sample Reuse for Direct Policy Search in Reinforcement Learning , 2011, Neural Computation.

[18]  Sergey Levine,et al.  Offline Reinforcement Learning as One Big Sequence Modeling Problem , 2021, NeurIPS.