论文信息 - Parameter-based Value Functions

Parameter-based Value Functions

Traditional off-policy actor-critic Reinforcement Learning (RL) algorithms learn value functions of a single target policy. However, when value functions are updated to track the learned policy, they forget potentially useful information about old policies. We introduce a class of value functions called Parameter-based Value Functions (PVFs) whose inputs include the policy parameters. They can generalize across different policies. PVFs can evaluate the performance of any policy given a state, a state-action pair, or a distribution over the RL agent's initial states. First we show how PVFs yield novel off-policy policy gradient theorems. Then we derive off-policy actor-critic algorithms based on PVFs trained by Monte Carlo or Temporal Difference methods. We show how learned PVFs can zero-shot learn new policies that outperform any policy seen during training. Finally our algorithms are evaluated on a selection of discrete and continuous control tasks using shallow policies and deep neural networks. Their performance is comparable to the one of state-of-the-art methods.

Francesco Faccio | J. Schmidhuber | Jurgen Schmidhuber | Francesco Faccio

[1] G. Box,et al. On the Experimental Attainment of Optimum Conditions , 1951 .

[2] R. L. Stratonovich. CONDITIONAL MARKOV PROCESSES , 1960 .

[3] Reuven Y. Rubinstein,et al. Simulation and the Monte Carlo method , 1981, Wiley series in probability and mathematical statistics.

[4] Richard S. Sutton,et al. Temporal credit assignment in reinforcement learning , 1984 .

[5] PAUL J. WERBOS,et al. Generalization of backpropagation with application to a recurrent gas market model , 1988, Neural Networks.

[6] Paul J. Werbos,et al. Backpropagation Through Time: What It Does and How to Do It , 1990, Proc. IEEE.

[7] Jürgen Schmidhuber,et al. Networks adjusting networks , 1990 .

[8] Sepp Hochreiter,et al. Untersuchungen zu dynamischen neuronalen Netzen , 1991 .

[9] Jürgen Schmidhuber,et al. Learning to Generate Artificial Fovea Trajectories for Target Detection , 1991, Int. J. Neural Syst..

[10] Jürgen Schmidhuber,et al. Learning to Control Fast-Weight Memories: An Alternative to Dynamic Recurrent Networks , 1992, Neural Computation.

[11] Jürgen Schmidhuber,et al. A ‘Self-Referential’ Weight Matrix , 1993 .

[12] Martin L. Puterman,et al. Markov Decision Processes: Discrete Stochastic Dynamic Programming , 1994 .

[13] Gerald Tesauro,et al. Temporal Difference Learning and TD-Gammon , 1995, J. Int. Comput. Games Assoc..

[14] Andrew W. Moore,et al. Memory-based Stochastic Optimization , 1995, NIPS.

[15] Ronald J. Williams,et al. Gradient-based learning algorithms for recurrent networks and their computational complexity , 1995 .

[16] Leemon C. Baird,et al. Residual Algorithms: Reinforcement Learning with Function Approximation , 1995, ICML.

[17] Jürgen Schmidhuber,et al. Long Short-Term Memory , 1997, Neural Computation.

[18] John E. Dennis,et al. Optimization Using Surrogate Objectives on a Helicopter Test Example , 1998 .

[19] John N. Tsitsiklis,et al. Actor-Critic Algorithms , 1999, NIPS.

[20] Yishay Mansour,et al. Policy Gradient Methods for Reinforcement Learning with Function Approximation , 1999, NIPS.

[21] Herbert Jaeger,et al. The''echo state''approach to analysing and training recurrent neural networks , 2001 .

[22] Donald R. Jones,et al. A Taxonomy of Global Optimization Methods Based on Response Surfaces , 2001, J. Glob. Optim..

[23] Sanjoy Dasgupta,et al. Off-Policy Temporal Difference Learning with Function Approximation , 2001, ICML.

[24] Henry Markram,et al. Real-Time Computing Without Stable States: A New Framework for Neural Computation Based on Perturbations , 2002, Neural Computation.

[25] Jürgen Schmidhuber,et al. A robot that reinforcement-learns to identify and memorize important previous observations , 2003, Proceedings 2003 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS 2003) (Cat. No.03CH37453).

[26] C. Malsburg. Self-organization of orientation sensitive cells in the striate cortex , 2004, Kybernetik.

[27] A. Rollett,et al. The Monte Carlo Method , 2004 .

[28] Richard S. Sutton,et al. Reinforcement Learning: An Introduction , 1998, IEEE Trans. Neural Networks.

[29] Jürgen Schmidhuber,et al. Modeling systems with internal state using evolino , 2005, GECCO '05.

[30] Jürgen Schmidhuber,et al. Evolino: Hybrid Neuroevolution / Optimal Linear Search for Sequence Prediction , 2005, IJCAI 2005.

[31] Richard S. Sutton,et al. Learning to predict by the methods of temporal differences , 1988, Machine Learning.

[32] Stefan Schaal,et al. Natural Actor-Critic , 2003, Neurocomputing.

[33] Shalabh Bhatnagar,et al. Incremental Natural Actor-Critic Algorithms , 2007, NIPS.

[34] Jürgen Schmidhuber,et al. Training Recurrent Networks by Evolino , 2007, Neural Computation.

[35] Richard S. Sutton,et al. A Convergent O(n) Temporal-difference Algorithm for Off-policy Learning with Linear Function Approximation , 2008, NIPS.

[36] Frank Sehnke,et al. Policy Gradients with Parameter-Based Exploration for Control , 2008, ICANN.

[37] V. Borkar. Stochastic Approximation: A Dynamical Systems Viewpoint , 2008, Texts and Readings in Mathematics.

[38] Tom Schaul,et al. Natural Evolution Strategies , 2008, 2008 IEEE Congress on Evolutionary Computation (IEEE World Congress on Computational Intelligence).

[39] Stefan Schaal,et al. 2008 Special Issue: Reinforcement learning of motor skills with policy gradients , 2008 .

[40] Shalabh Bhatnagar,et al. Fast gradient-descent methods for temporal-difference learning with linear function approximation , 2009, ICML '09.

[41] Shalabh Bhatnagar,et al. Convergent Temporal-Difference Learning with Arbitrary Smooth Function Approximation , 2009, NIPS.

[42] Yishay Mansour,et al. Learning Bounds for Importance Weighting , 2010, NIPS.

[43] Shalabh Bhatnagar,et al. Toward Off-Policy Learning Control with Function Approximation , 2010, ICML.

[44] Jürgen Schmidhuber,et al. Recurrent policy gradients , 2010, Log. J. IGPL.

[45] Yoshua Bengio,et al. Understanding the difficulty of training deep feedforward neural networks , 2010, AISTATS.

[46] Frank Sehnke,et al. Parameter-exploring policy gradients , 2010, Neural Networks.

[47] Patrick M. Pilarski,et al. Horde: a scalable real-time architecture for learning knowledge from unsupervised sensorimotor interaction , 2011, AAMAS.

[48] R. Sutton,et al. Gradient temporal-difference learning algorithms , 2011 .

[49] Martha White,et al. Linear Off-Policy Actor-Critic , 2012, ICML.

[50] Jasper Snoek,et al. Practical Bayesian Optimization of Machine Learning Algorithms , 2012, NIPS.