论文信息 - Parameter-based Value Functions

Parameter-based Value Functions

Traditional off-policy actor-critic Reinforcement Learning (RL) algorithms learn value functions of a single target policy. However, when value functions are updated to track the learned policy, they forget potentially useful information about old policies. We introduce a class of value functions called Parameter-based Value Functions (PVFs) whose inputs include the policy parameters. They can generalize across different policies. PVFs can evaluate the performance of any policy given a state, a state-action pair, or a distribution over the RL agent's initial states. First we show how PVFs yield novel off-policy policy gradient theorems. Then we derive off-policy actor-critic algorithms based on PVFs trained by Monte Carlo or Temporal Difference methods. We show how learned PVFs can zero-shot learn new policies that outperform any policy seen during training. Finally our algorithms are evaluated on a selection of discrete and continuous control tasks using shallow policies and deep neural networks. Their performance is comparable to the one of state-of-the-art methods.

Jurgen Schmidhuber | Francesco Faccio

[1] Ronald J. Williams,et al. Gradient-based learning algorithms for recurrent networks and their computational complexity , 1995 .

[2] Hamid Reza Maei,et al. Convergent Actor-Critic Algorithms Under Off-Policy Training and Function Approximation , 2018, ArXiv.

[3] Jürgen Schmidhuber,et al. Evolino: Hybrid Neuroevolution / Optimal Linear Search for Sequence Prediction , 2005, IJCAI 2005.

[4] Paul J. Werbos,et al. Backpropagation Through Time: What It Does and How to Do It , 1990, Proc. IEEE.

[5] Benjamin Recht,et al. Simple random search of static linear policies is competitive for reinforcement learning , 2018, NeurIPS.

[6] Yishay Mansour,et al. Learning Bounds for Importance Weighting , 2010, NIPS.

[7] Martha White,et al. An Emphatic Approach to the Problem of Off-policy Temporal-Difference Learning , 2015, J. Mach. Learn. Res..

[8] John N. Tsitsiklis,et al. Actor-Critic Algorithms , 1999, NIPS.

[9] R. Sutton,et al. Gradient temporal-difference learning algorithms , 2011 .

[10] J. Schmidhuber. Making the World Differentiable: On Using Self-Supervised Fully Recurrent Neural Networks for Dynamic Reinforcement Learning and Planning in Non-Stationary Environm~nts , 2018 .

[11] Shalabh Bhatnagar,et al. Toward Off-Policy Learning Control with Function Approximation , 2010, ICML.

[12] Sepp Hochreiter,et al. Untersuchungen zu dynamischen neuronalen Netzen , 1991 .

[13] Richard S. Sutton,et al. Learning to predict by the methods of temporal differences , 1988, Machine Learning.

[14] Jürgen Schmidhuber,et al. Networks adjusting networks , 1990 .

[15] Shalabh Bhatnagar,et al. Incremental Natural Actor-Critic Algorithms , 2007, NIPS.

[16] Prabhat,et al. Scalable Bayesian Optimization Using Deep Neural Networks , 2015, ICML.

[17] Marcello Restelli,et al. Policy Optimization via Importance Sampling , 2018, NeurIPS.

[18] PAUL J. WERBOS,et al. Generalization of backpropagation with application to a recurrent gas market model , 1988, Neural Networks.

[19] Jürgen Schmidhuber,et al. Training Recurrent Networks by Evolino , 2007, Neural Computation.

[20] Yishay Mansour,et al. Policy Gradient Methods for Reinforcement Learning with Function Approximation , 1999, NIPS.

[21] Shalabh Bhatnagar,et al. Fast gradient-descent methods for temporal-difference learning with linear function approximation , 2009, ICML '09.

[22] Jürgen Schmidhuber,et al. A ‘Self-Referential’ Weight Matrix , 1993 .

[23] Henry Markram,et al. Real-Time Computing Without Stable States: A New Framework for Neural Computation Based on Perturbations , 2002, Neural Computation.

[24] Tom Schaul,et al. Natural Evolution Strategies , 2008, 2008 IEEE Congress on Evolutionary Computation (IEEE World Congress on Computational Intelligence).

[25] Richard S. Sutton,et al. Reinforcement Learning: An Introduction , 1998, IEEE Trans. Neural Networks.

[26] Patrick M. Pilarski,et al. Horde: a scalable real-time architecture for learning knowledge from unsupervised sensorimotor interaction , 2011, AAMAS.

[27] Martin L. Puterman,et al. Markov Decision Processes: Discrete Stochastic Dynamic Programming , 1994 .

[28] Geoffrey E. Hinton,et al. Layer Normalization , 2016, ArXiv.

[29] Frank Sehnke,et al. Policy Gradients with Parameter-Based Exploration for Control , 2008, ICANN.

[30] Shalabh Bhatnagar,et al. Convergent Temporal-Difference Learning with Arbitrary Smooth Function Approximation , 2009, NIPS.

[31] R. L. Stratonovich. CONDITIONAL MARKOV PROCESSES , 1960 .

[32] Jürgen Schmidhuber,et al. A robot that reinforcement-learns to identify and memorize important previous observations , 2003, Proceedings 2003 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS 2003) (Cat. No.03CH37453).

[33] Gerald Tesauro,et al. Temporal Difference Learning and TD-Gammon , 1995, J. Int. Comput. Games Assoc..

[34] Tom Schaul,et al. Policy Evaluation Networks , 2020, ArXiv.

[35] Jürgen Schmidhuber,et al. Learning to Generate Artificial Fovea Trajectories for Target Detection , 1991, Int. J. Neural Syst..

[36] Jürgen Schmidhuber,et al. Learning to Control Fast-Weight Memories: An Alternative to Dynamic Recurrent Networks , 1992, Neural Computation.