Maximum Entropy Reinforcement Learning with Mixture Policies

Abstract Mixture models are an expressive hypothesis class that can approximate a rich set of policies. However, using mixture policies in the Maximum Entropy (MaxEnt) framework is not straightforward. The entropy of a mixture model is not equal to the sum of its components, nor does it have a closed-form expression in most cases. Using such policies in MaxEnt algorithms, therefore, requires constructing a tractable approximation of the mixture entropy. In this paper, we derive a simple, low-variance mixture-entropy estimator. We show that it is closely related to the sum of marginal entropies. Equipped with our entropy estimator, we derive an algorithmic variant of Soft Actor-Critic (SAC) to the mixture policy case and evaluate it on a series of continuous control tasks.

[1]  Doina Precup,et al.  Multi-time Models for Temporally Abstract Planning , 1997, NIPS.

[2]  J. Andrew Bagnell,et al.  Modeling Purposeful Adaptive Behavior with the Principle of Maximum Causal Entropy , 2010 .

[3]  Marco Wiering,et al.  Ensemble Algorithms in Reinforcement Learning , 2008, IEEE Transactions on Systems, Man, and Cybernetics, Part B (Cybernetics).

[4]  Nahum Shimkin,et al.  Averaged-DQN: Variance Reduction and Stabilization for Deep Reinforcement Learning , 2016, ICML.

[5]  Alejandro Agostini,et al.  Reinforcement Learning with a Gaussian mixture model , 2010, The 2010 International Joint Conference on Neural Networks (IJCNN).

[6]  Thomas M. Cover,et al.  Elements of Information Theory , 2005 .

[7]  Darwin G. Caldwell,et al.  Compliant skills acquisition and multi-optima policy search with EM-based reinforcement learning , 2013, Robotics Auton. Syst..

[8]  Benjamin Van Roy,et al.  Deep Exploration via Bootstrapped DQN , 2016, NIPS.

[9]  Max Welling,et al.  Auto-Encoding Variational Bayes , 2013, ICLR.

[10]  Shane Legg,et al.  Human-level control through deep reinforcement learning , 2015, Nature.

[11]  David Silver,et al.  A Unified Game-Theoretic Approach to Multiagent Reinforcement Learning , 2017, NIPS.

[12]  Michael L. Littman,et al.  An Ensemble of Linearly Combined Reinforcement-Learning Agents , 2013, AAAI.

[13]  Andrei V. Kelarev,et al.  Constructing Stochastic Mixture Policies for Episodic Multiobjective Reinforcement Learning Tasks , 2009, Australasian Conference on Artificial Intelligence.

[14]  Tobias J. Oechtering,et al.  On the Entropy Computation of Large Complex Gaussian Mixture Distributions , 2015, IEEE Transactions on Signal Processing.

[15]  P. Hall,et al.  On the estimation of entropy , 1993 .

[16]  Miguel Á. Carreira-Perpiñán,et al.  Mode-Finding for Mixtures of Gaussian Distributions , 2000, IEEE Trans. Pattern Anal. Mach. Intell..

[17]  Pieter Abbeel,et al.  Apprenticeship learning via inverse reinforcement learning , 2004, ICML.

[18]  Dale Schuurmans,et al.  Striving for Simplicity in Off-policy Deep Reinforcement Learning , 2019, ArXiv.

[19]  Hugh F. Durrant-Whyte,et al.  On entropy approximation for Gaussian mixture random vectors , 2008, 2008 IEEE International Conference on Multisensor Fusion and Integration for Intelligent Systems.

[20]  Sham M. Kakade,et al.  A Natural Policy Gradient , 2001, NIPS.

[21]  Artemy Kolchinsky,et al.  Estimating Mixture Entropy with Pairwise Distances , 2017, Entropy.

[22]  Shie Mannor,et al.  Distributional Policy Optimization: An Alternative Approach for Continuous Control , 2019, NeurIPS.

[23]  Daan Wierstra,et al.  Stochastic Backpropagation and Approximate Inference in Deep Generative Models , 2014, ICML.

[24]  Sergey Levine,et al.  Soft Actor-Critic: Off-Policy Maximum Entropy Deep Reinforcement Learning with a Stochastic Actor , 2018, ICML.

[25]  Andrew G. Barto,et al.  PolicyBlocks: An Algorithm for Creating Useful Macro-Actions in Reinforcement Learning , 2002, ICML.

[26]  Henry Zhu,et al.  Soft Actor-Critic Algorithms and Applications , 2018, ArXiv.

[27]  Wojciech M. Czarnecki,et al.  Grandmaster level in StarCraft II using multi-agent reinforcement learning , 2019, Nature.

[28]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[29]  Yuval Tassa,et al.  MuJoCo: A physics engine for model-based control , 2012, 2012 IEEE/RSJ International Conference on Intelligent Robots and Systems.

[30]  Anind K. Dey,et al.  Maximum Entropy Inverse Reinforcement Learning , 2008, AAAI.

[31]  Joelle Pineau,et al.  OptionGAN: Learning Joint Reward-Policy Options using Generative Adversarial Inverse Reinforcement Learning , 2017, AAAI.

[32]  Pieter Abbeel,et al.  SUNRISE: A Simple Unified Framework for Ensemble Learning in Deep Reinforcement Learning , 2021, ICML.

[33]  H. Joe Estimation of entropy and other functionals of a multivariate density , 1989 .

[34]  Peng-Yeng Yin,et al.  Maximum entropy-based optimal threshold selection using deterministic reinforcement learning with controlled randomization , 2002, Signal Process..

[35]  Ronald J. Williams,et al.  Simple Statistical Gradient-Following Algorithms for Connectionist Reinforcement Learning , 2004, Machine Learning.

[36]  Herke van Hoof,et al.  Addressing Function Approximation Error in Actor-Critic Methods , 2018, ICML.

[37]  Yuval Tassa,et al.  Continuous control with deep reinforcement learning , 2015, ICLR.

[38]  Yee Whye Teh,et al.  Mix&Match - Agent Curricula for Reinforcement Learning , 2018, ICML.

[39]  Thomas G. Dietterich The MAXQ Method for Hierarchical Reinforcement Learning , 1998, ICML.

[40]  Martin L. Puterman,et al.  Markov Decision Processes: Discrete Stochastic Dynamic Programming , 1994 .

[41]  Naomi S. Altman,et al.  Quantile regression , 2019, Nature Methods.

[42]  Friedhelm Schwenker,et al.  Ensemble Methods for Reinforcement Learning with Function Approximation , 2011, MCS.

[43]  Sergey Levine,et al.  Learning Robust Rewards with Adversarial Inverse Reinforcement Learning , 2017, ICLR 2017.

[44]  Jun Song,et al.  Optimistic Distributionally Robust Policy Optimization , 2020, ArXiv.