Meta-Learning of Exploration and Exploitation Parameters with Replacing Eligibility Traces

When developing autonomous learning agents, the performance depends crucially on the selection of reasonable learning parameters, for example learning rates or exploration parameters. In this work we investigate meta-learning of exploration parameters by using the “REINFORCE exploration control” (REC) framework, and combine REC with replacing eligibility traces, which are a basic mechanism for tackling the problem of delayed rewards in reinforcement learning. We show empirically for a robot example and the mountain–car problem with two goals how the proposed combination can help to improve learning performance. Furthermore, we also observe that the setting of time constant \(\lambda \) is not straightforward, because it is intimately interrelated with the learning rate \(\alpha \).

[1]  Günther Palm,et al.  Value-Difference Based Exploration: Adaptive Control between Epsilon-Greedy and Softmax , 2011, KI.

[2]  Neil D. Lawrence,et al.  Missing Data in Kernel PCA , 2006, ECML.

[3]  Andreea C. Bostan,et al.  The basal ganglia communicate with the cerebellum , 2010, Proceedings of the National Academy of Sciences.

[4]  Ronald J. Williams,et al.  Simple Statistical Gradient-Following Algorithms for Connectionist Reinforcement Learning , 2004, Machine Learning.

[5]  Michel Tokic,et al.  Teaching Reinforcement Learning using a Physical Robot , 2012 .

[6]  Y. Niv Reinforcement learning in the brain , 2009 .

[7]  Csaba Szepesvári,et al.  Bandit Based Monte-Carlo Planning , 2006, ECML.

[8]  Friedhelm Schwenker,et al.  Learning a Strategy with Neural Approximated Temporal-Difference Methods in English Draughts , 2010, 2010 20th International Conference on Pattern Recognition.

[9]  Kunikazu Kobayashi,et al.  A Meta-learning Method Based on Temporal Difference Error , 2009, ICONIP.

[10]  Martin A. Riedmiller,et al.  Learning to Drive a Real Car in 20 Minutes , 2007, 2007 Frontiers in the Convergence of Bioscience and Information Technologies.

[11]  Günther Palm,et al.  Adaptive Exploration Using Stochastic Neurons , 2012, ICANN.

[12]  Michel Tokic,et al.  Adaptive epsilon-Greedy Exploration in Reinforcement Learning Based on Value Difference , 2010, KI.

[13]  Kenji Doya,et al.  Meta-learning in Reinforcement Learning , 2003, Neural Networks.

[14]  Richard S. Sutton,et al.  Reinforcement Learning with Replacing Eligibility Traces , 2005, Machine Learning.

[15]  Michel Tokic Adaptive ε-greedy Exploration in Reinforcement Learning Based on Value Differences , 2010 .

[16]  Richard S. Sutton,et al.  Reinforcement Learning: An Introduction , 1998, IEEE Trans. Neural Networks.

[17]  Günther Palm,et al.  Gradient Algorithms for Exploration/Exploitation Trade-Offs: Global and Local Variants , 2012, ANNPR.

[18]  Mehryar Mohri,et al.  Multi-armed Bandit Algorithms and Empirical Evaluation , 2005, ECML.

[19]  Shigenobu Kobayashi,et al.  Reinforcement Learning in POMDPs with Function Approximation , 1997, ICML.

[20]  Marco Wiering,et al.  Explorations in efficient reinforcement learning , 1999 .

[21]  Ben J. A. Kröse,et al.  Learning from delayed rewards , 1995, Robotics Auton. Syst..

[22]  Günther Palm,et al.  Robust Exploration/Exploitation Trade-Offs in Safety-Critical Applications , 2012 .

[23]  Warren B. Powell,et al.  Adaptive stepsizes for recursive estimation with applications in approximate dynamic programming , 2006, Machine Learning.

[24]  Friedhelm Schwenker,et al.  Neural Approximation of Monte Carlo Policy Evaluation Deployed in Connect Four , 2008, ANNPR.

[25]  Kenji Doya,et al.  What are the computations of the cerebellum, the basal ganglia and the cerebral cortex? , 1999, Neural Networks.

[26]  Peter Auer,et al.  Using Confidence Bounds for Exploitation-Exploration Trade-offs , 2003, J. Mach. Learn. Res..

[27]  Martin A. Riedmiller Neural Fitted Q Iteration - First Experiences with a Data Efficient Neural Reinforcement Learning Method , 2005, ECML.

[28]  P. Dayan,et al.  Choice values , 2006, Nature Neuroscience.