Generalized exploration in policy search

To learn control policies in unknown environments, learning agents need to explore by trying actions deemed suboptimal. In prior work, such exploration is performed by either perturbing the actions at each time-step independently, or by perturbing policy parameters over an entire episode. Since both of these strategies have certain advantages, a more balanced trade-off could be beneficial. We introduce a unifying view on step-based and episode-based exploration that allows for such balanced trade-offs. This trade-off strategy can be used with various reinforcement learning algorithms. In this paper, we study this generalized exploration strategy in a policy gradient method and in relative entropy policy search. We evaluate the exploration strategy on four dynamical systems and compare the results to the established step-based and episode-based exploration strategies. Our results show that a more balanced trade-off can yield faster learning and better final policies, and illustrate some of the effects that cause these performance differences.

[1]  Petros Koumoutsakos,et al.  Reducing the Time Complexity of the Derandomized Evolution Strategy with Covariance Matrix Adaptation (CMA-ES) , 2003, Evolutionary Computation.

[2]  Carl E. Rasmussen,et al.  Gaussian process dynamic programming , 2009, Neurocomputing.

[3]  Joshua B. Tenenbaum,et al.  Nonparametric Bayesian Policy Priors for Reinforcement Learning , 2010, NIPS.

[4]  Benjamin Van Roy,et al.  Generalization and Exploration via Randomized Value Functions , 2014, ICML.

[5]  Wouter Caarls,et al.  Learning while preventing mechanical failure due to random motions , 2013, 2013 IEEE/RSJ International Conference on Intelligent Robots and Systems.

[6]  Peter L. Bartlett,et al.  Infinite-Horizon Policy-Gradient Estimation , 2001, J. Artif. Intell. Res..

[7]  Jan Peters,et al.  Learning of Non-Parametric Control Policies with High-Dimensional State Features , 2015, AISTATS.

[8]  Yang Liu,et al.  A new Q-learning algorithm based on the metropolis criterion , 2004, IEEE Transactions on Systems, Man, and Cybernetics, Part B (Cybernetics).

[9]  Benjamin Van Roy,et al.  Deep Exploration via Bootstrapped DQN , 2016, NIPS.

[10]  Malcolm J. A. Strens,et al.  A Bayesian Framework for Reinforcement Learning , 2000, ICML.

[11]  P. Cochat,et al.  Et al , 2008, Archives de pediatrie : organe officiel de la Societe francaise de pediatrie.

[12]  Jan Peters,et al.  A Survey on Policy Search for Robotics , 2013, Found. Trends Robotics.

[13]  Jun Nakanishi,et al.  Learning Movement Primitives , 2005, ISRR.

[14]  Satinder Singh Transfer of learning by composing solutions of elemental sequential tasks , 2004, Machine Learning.

[15]  Chris Watkins,et al.  Sex as Gibbs Sampling: a probability model of evolution , 2014, ArXiv.

[16]  Daniel A. Braun,et al.  A Minimum Relative Entropy Principle for Learning and Acting , 2008, J. Artif. Intell. Res..

[17]  Stefan Schaal,et al.  Hierarchical reinforcement learning with movement primitives , 2011, 2011 11th IEEE-RAS International Conference on Humanoid Robots.

[18]  Leslie Pack Kaelbling,et al.  Bayesian Policy Search with Policy Priors , 2011, IJCAI.

[19]  Jan Peters,et al.  Hierarchical Relative Entropy Policy Search , 2014, AISTATS.

[20]  Jan Peters,et al.  Reinforcement learning in robotics: A survey , 2013, Int. J. Robotics Res..

[21]  M. Deisenroth,et al.  Probabilistic Inference for Fast Learning in Control in Proceedings of the 8 th European Workshop on Reinforcement Learning ( EWRL , 2008 .

[22]  Peter Stone,et al.  Policy gradient reinforcement learning for fast quadrupedal locomotion , 2004, IEEE International Conference on Robotics and Automation, 2004. Proceedings. ICRA '04. 2004.

[23]  Rémi Munos,et al.  Policy Gradient in Continuous Time , 2006, J. Mach. Learn. Res..

[24]  Bruno Castro da Silva,et al.  Learning Parameterized Skills , 2012, ICML.

[25]  Tom Schaul,et al.  Exploring parameter space in reinforcement learning , 2010, Paladyn J. Behav. Robotics.

[26]  Yuval Tassa,et al.  Continuous control with deep reinforcement learning , 2015, ICLR.

[27]  Jun Morimoto,et al.  Acquisition of stand-up behavior by a real robot using hierarchical reinforcement learning , 2000, Robotics Auton. Syst..

[28]  Lihong Li,et al.  A Bayesian Sampling Approach to Exploration in Reinforcement Learning , 2009, UAI.

[29]  Olivier Sigaud,et al.  Path Integral Policy Improvement with Covariance Matrix Adaptation , 2012, ICML.

[30]  Jan Peters,et al.  Probabilistic inference for determining options in reinforcement learning , 2016, Machine Learning.

[31]  Jeremy Wyatt,et al.  Exploration and inference in learning from reinforcement , 1998 .

[32]  Stuart J. Russell,et al.  Reinforcement Learning with Hierarchies of Machines , 1997, NIPS.

[33]  Andrew W. Moore,et al.  Reinforcement Learning: A Survey , 1996, J. Artif. Intell. Res..

[34]  Thomas G. Dietterich Hierarchical Reinforcement Learning with the MAXQ Value Function Decomposition , 1999, J. Artif. Intell. Res..

[35]  Doina Precup,et al.  Between MDPs and Semi-MDPs: A Framework for Temporal Abstraction in Reinforcement Learning , 1999, Artif. Intell..

[36]  Frank Sehnke,et al.  Parameter-exploring policy gradients , 2010, Neural Networks.

[37]  Leslie Pack Kaelbling,et al.  Hierarchical Learning in Stochastic Domains: Preliminary Results , 1993, ICML.

[38]  Nando de Freitas,et al.  Bayesian Policy Learning with Trans-Dimensional MCMC , 2007, NIPS.

[39]  R. J. Williams,et al.  Simple Statistical Gradient-Following Algorithms for Connectionist Reinforcement Learning , 2004, Machine Learning.

[40]  Darwin G. Caldwell,et al.  Direct policy search reinforcement learning based on particle filtering , 2012, EWRL 2012.

[41]  Yasemin Altun,et al.  Relative Entropy Policy Search , 2010 .

[42]  Sridhar Mahadevan,et al.  Hierarchical Policy Gradient Algorithms , 2003, ICML.

[43]  Alex Graves,et al.  Strategic Attentive Writer for Learning Macro-Actions , 2016, NIPS.

[44]  Stuart J. Russell,et al.  Bayesian Q-Learning , 1998, AAAI/IAAI.

[45]  Jan Peters,et al.  Policy Search for Motor Primitives in Robotics , 2008, NIPS 2008.

[46]  Andrew G. Barto,et al.  Skill Discovery in Continuous Reinforcement Learning Domains using Skill Chaining , 2009, NIPS.

[47]  Richard S. Sutton,et al.  Reinforcement Learning: An Introduction , 1998, IEEE Trans. Neural Networks.

[48]  Peter Stone,et al.  Deep Reinforcement Learning in Parameterized Action Space , 2015, ICLR.

[49]  Doina Precup,et al.  Temporal abstraction in reinforcement learning , 2000, ICML 2000.

[50]  David Andre,et al.  Model based Bayesian Exploration , 1999, UAI.

[51]  W. K. Hastings,et al.  Monte Carlo Sampling Methods Using Markov Chains and Their Applications , 1970 .

[52]  Benjamin Recht,et al.  Random Features for Large-Scale Kernel Machines , 2007, NIPS.

[53]  Stefan Schaal,et al.  A Generalized Path Integral Control Approach to Reinforcement Learning , 2010, J. Mach. Learn. Res..