Similarities and differences between policy gradient methods and evolution strategies

Natural policy gradient methods and the covariance matrix adaptation evolution strategy, two variable metric methods proposed for solving reinforcement learning tasks, are contrasted to point out their con- ceptual similarities and differences. Experiments on the cart pole bench- mark are conducted as a first attempt to compare their performance. Reinforcement learning (RL) algorithms search for a policy mapping states of the environment to (a probability distribution over) the actions an agent can take in those states. The goal is to find a behavior such that some notion of future reward is maximized. Direct policy search methods address this task by directly learning parameters of a function explicitly representing the policy. Here we consider two general approaches to conduct direct policy search, namely policy gradient methods (PGMs) and evolution strategies (ESs). We will argue that these approaches are quite similar. This makes it all the more surprising that so far there has been no systematic comparison of PGMs and ESs applied to the same test problems and operating on the same class of policies with the same parameterization. This paper is our attempt to draw such a comparison, on a conceptual level and by conducting first empirical studies. We restrict our consideration to the natural actor critic algorithm (NAC, (1, 2)) and the covari- ance matrix adaptation ES (CMA-ES, (3)), which are compared in the context of optimization in (4). We picked these two because they can be considered state- of-the-art, they are our favorite direct policy search method and evolutionary RL algorithm, respectively, and they are both variable metric methods. In section 2 we briefly review the NAC algorithm and the CMA-ES. Section 3 describes the conceptual relations of these two approaches and in section 4 we use a simple toy problem to compare the methods empirically.

[1]  Yishay Mansour,et al.  Policy Gradient Methods for Reinforcement Learning with Function Approximation , 1999, NIPS.

[2]  Petros Koumoutsakos,et al.  Reducing the Time Complexity of the Derandomized Evolution Strategy with Covariance Matrix Adaptation (CMA-ES) , 2003, Evolutionary Computation.

[3]  Stefan Schaal,et al.  Reinforcement Learning for Humanoid Robotics , 2003 .

[4]  Jeff G. Schneider,et al.  Covariant Policy Search , 2003, IJCAI.

[5]  Gregor Schöner,et al.  Making Driver Modeling Attractive , 2005, IEEE Intell. Syst..

[6]  Nikolaus Hansen,et al.  The CMA Evolution Strategy: A Comparing Review , 2006, Towards a New Evolutionary Computation.

[7]  Christian Igel,et al.  Reinforcement learning in a nutshell , 2007, ESANN.

[8]  Stefan Schaal,et al.  Applying the Episodic Natural Actor-Critic Architecture to Motor Primitive Learning , 2007, ESANN.

[9]  Martin A. Riedmiller,et al.  Evaluation of Policy Gradient Methods and Variants on the Cart-Pole Benchmark , 2007, 2007 IEEE International Symposium on Approximate Dynamic Programming and Reinforcement Learning.

[10]  Tom Schaul,et al.  Natural Evolution Strategies , 2008, 2008 IEEE Congress on Evolutionary Computation (IEEE World Congress on Computational Intelligence).