Entropic Risk Measure in Policy Search

With the increasing pace of automation, modern robotic systems need to act in stochastic, non-stationary, partially observable environments. A range of algorithms for finding parameterized policies that optimize for long-term average performance have been proposed in the past. However, the majority of the proposed approaches does not explicitly take into account the variability of the performance metric, which may lead to finding policies that although performing well on average, can perform spectacularly bad in a particular run or over a period of time. To address this shortcoming, we study an approach to policy optimization that explicitly takes into account higher order statistics of the reward function. In this paper, we extend policy gradient methods to include the entropic risk measure in the objective function and evaluate their performance in simulation experiments and on a real-robot task of learning a hitting motion in robot badminton.

[1]  Rhodes,et al.  Optimal stochastic linear systems with exponential performance criteria and their relation to deterministic differential games , 1973 .

[2]  P. Whittle Risk-sensitive linear/quadratic/gaussian control , 1981, Advances in Applied Probability.

[3]  P. Whittle A risk-sensitive maximum principle , 1990 .

[4]  W. Fleming,et al.  Risk sensitive optimal control and differential games , 1992 .

[5]  Daniel Hernández-Hernández,et al.  Risk Sensitive Markov Decision Processes , 1997 .

[6]  Yishay Mansour,et al.  Policy Gradient Methods for Reinforcement Learning with Function Approximation , 1999, NIPS.

[7]  Sham M. Kakade,et al.  A Natural Policy Gradient , 2001, NIPS.

[8]  P. Whittle RISK SENSITIVITY, A STRANGELY PERVASIVE CONCEPT , 2002, Macroeconomic Dynamics.

[9]  Ronald J. Williams,et al.  Simple Statistical Gradient-Following Algorithms for Connectionist Reinforcement Learning , 2004, Machine Learning.

[10]  Sean R Eddy,et al.  What is dynamic programming? , 2004, Nature Biotechnology.

[11]  Richard S. Sutton,et al.  Reinforcement Learning: An Introduction , 1998, IEEE Trans. Neural Networks.

[12]  Stefan Schaal,et al.  Reinforcement learning by reward-weighted regression for operational space control , 2007, ICML '07.

[13]  Michael I. Jordan,et al.  Graphical Models, Exponential Families, and Variational Inference , 2008, Found. Trends Mach. Learn..

[14]  Tom Schaul,et al.  Natural Evolution Strategies , 2008, 2008 IEEE Congress on Evolutionary Computation (IEEE World Congress on Computational Intelligence).

[15]  Jan Peters,et al.  Learning motor primitives for robotics , 2009, 2009 IEEE International Conference on Robotics and Automation.

[16]  Yasemin Altun,et al.  Relative Entropy Policy Search , 2010 .

[17]  Gerhard Neumann,et al.  Variational Inference for Policy Search in changing situations , 2011, ICML.

[18]  H. Föllmer,et al.  ENTROPIC RISK MEASURES: COHERENCE VS. CONVEXITY, MODEL AMBIGUITY AND ROBUST LARGE DEVIATIONS , 2011 .

[19]  Jan Peters,et al.  Nonamemanuscript No. (will be inserted by the editor) Reinforcement Learning to Adjust Parametrized Motor Primitives to , 2011 .

[20]  Shie Mannor,et al.  Policy Gradients with Variance Related Risk Criteria , 2012, ICML.

[21]  Jan Peters,et al.  Probabilistic Movement Primitives , 2013, NIPS.

[22]  Jan Peters,et al.  A Survey on Policy Search for Robotics , 2013, Found. Trends Robotics.

[23]  Prashanth L.A,et al.  Policy Gradients for CVaR-Constrained MDPs , 2014, 1405.2690.

[24]  Luís Paulo Reis,et al.  Deriving and improving CMA-ES with information geometric trust regions , 2017, GECCO.

[25]  Stephen P. Boyd,et al.  Multi-Period Trading via Convex Optimization , 2017, Found. Trends Optim..

[26]  Sham M. Kakade,et al.  Towards Generalization and Simplicity in Continuous Control , 2017, NIPS.

[27]  Alec Radford,et al.  Proximal Policy Optimization Algorithms , 2017, ArXiv.

[28]  Yuval Tassa,et al.  Maximum a Posteriori Policy Optimisation , 2018, ICLR.