论文信息 - Maximum Entropy Reinforcement Learning with Mixture Policies

Maximum Entropy Reinforcement Learning with Mixture Policies

Abstract Mixture models are an expressive hypothesis class that can approximate a rich set of policies. However, using mixture policies in the Maximum Entropy (MaxEnt) framework is not straightforward. The entropy of a mixture model is not equal to the sum of its components, nor does it have a closed-form expression in most cases. Using such policies in MaxEnt algorithms, therefore, requires constructing a tractable approximation of the mixture entropy. In this paper, we derive a simple, low-variance mixture-entropy estimator. We show that it is closely related to the sum of marginal entropies. Equipped with our entropy estimator, we derive an algorithmic variant of Soft Actor-Critic (SAC) to the mixture policy case and evaluate it on a series of continuous control tasks.

[1] Doina Precup,et al. Multi-time Models for Temporally Abstract Planning , 1997, NIPS.

[2] J. Andrew Bagnell,et al. Modeling Purposeful Adaptive Behavior with the Principle of Maximum Causal Entropy , 2010 .

[3] Marco Wiering,et al. Ensemble Algorithms in Reinforcement Learning , 2008, IEEE Transactions on Systems, Man, and Cybernetics, Part B (Cybernetics).

[4] Nahum Shimkin,et al. Averaged-DQN: Variance Reduction and Stabilization for Deep Reinforcement Learning , 2016, ICML.

[5] Alejandro Agostini,et al. Reinforcement Learning with a Gaussian mixture model , 2010, The 2010 International Joint Conference on Neural Networks (IJCNN).

[6] Thomas M. Cover,et al. Elements of Information Theory , 2005 .

[7] Darwin G. Caldwell,et al. Compliant skills acquisition and multi-optima policy search with EM-based reinforcement learning , 2013, Robotics Auton. Syst..

[8] Benjamin Van Roy,et al. Deep Exploration via Bootstrapped DQN , 2016, NIPS.

[9] Max Welling,et al. Auto-Encoding Variational Bayes , 2013, ICLR.

[10] Shane Legg,et al. Human-level control through deep reinforcement learning , 2015, Nature.

[11] David Silver,et al. A Unified Game-Theoretic Approach to Multiagent Reinforcement Learning , 2017, NIPS.

[12] Michael L. Littman,et al. An Ensemble of Linearly Combined Reinforcement-Learning Agents , 2013, AAAI.

[13] Andrei V. Kelarev,et al. Constructing Stochastic Mixture Policies for Episodic Multiobjective Reinforcement Learning Tasks , 2009, Australasian Conference on Artificial Intelligence.

[14] Tobias J. Oechtering,et al. On the Entropy Computation of Large Complex Gaussian Mixture Distributions , 2015, IEEE Transactions on Signal Processing.

[15] P. Hall,et al. On the estimation of entropy , 1993 .

[16] Miguel Á. Carreira-Perpiñán,et al. Mode-Finding for Mixtures of Gaussian Distributions , 2000, IEEE Trans. Pattern Anal. Mach. Intell..

[17] Pieter Abbeel,et al. Apprenticeship learning via inverse reinforcement learning , 2004, ICML.

[18] Dale Schuurmans,et al. Striving for Simplicity in Off-policy Deep Reinforcement Learning , 2019, ArXiv.

[19] Hugh F. Durrant-Whyte,et al. On entropy approximation for Gaussian mixture random vectors , 2008, 2008 IEEE International Conference on Multisensor Fusion and Integration for Intelligent Systems.

[20] Sham M. Kakade,et al. A Natural Policy Gradient , 2001, NIPS.

[21] Artemy Kolchinsky,et al. Estimating Mixture Entropy with Pairwise Distances , 2017, Entropy.

[22] Shie Mannor,et al. Distributional Policy Optimization: An Alternative Approach for Continuous Control , 2019, NeurIPS.

[23] Daan Wierstra,et al. Stochastic Backpropagation and Approximate Inference in Deep Generative Models , 2014, ICML.

[24] Sergey Levine,et al. Soft Actor-Critic: Off-Policy Maximum Entropy Deep Reinforcement Learning with a Stochastic Actor , 2018, ICML.

[25] Andrew G. Barto,et al. PolicyBlocks: An Algorithm for Creating Useful Macro-Actions in Reinforcement Learning , 2002, ICML.

[26] Henry Zhu,et al. Soft Actor-Critic Algorithms and Applications , 2018, ArXiv.

[27] Wojciech M. Czarnecki,et al. Grandmaster level in StarCraft II using multi-agent reinforcement learning , 2019, Nature.

[28] Jimmy Ba,et al. Adam: A Method for Stochastic Optimization , 2014, ICLR.

[29] Yuval Tassa,et al. MuJoCo: A physics engine for model-based control , 2012, 2012 IEEE/RSJ International Conference on Intelligent Robots and Systems.

[30] Anind K. Dey,et al. Maximum Entropy Inverse Reinforcement Learning , 2008, AAAI.

[31] Joelle Pineau,et al. OptionGAN: Learning Joint Reward-Policy Options using Generative Adversarial Inverse Reinforcement Learning , 2017, AAAI.

[32] Pieter Abbeel,et al. SUNRISE: A Simple Unified Framework for Ensemble Learning in Deep Reinforcement Learning , 2021, ICML.

[33] H. Joe. Estimation of entropy and other functionals of a multivariate density , 1989 .

[34] Peng-Yeng Yin,et al. Maximum entropy-based optimal threshold selection using deterministic reinforcement learning with controlled randomization , 2002, Signal Process..

[35] Ronald J. Williams,et al. Simple Statistical Gradient-Following Algorithms for Connectionist Reinforcement Learning , 2004, Machine Learning.

[36] Herke van Hoof,et al. Addressing Function Approximation Error in Actor-Critic Methods , 2018, ICML.

[37] Yuval Tassa,et al. Continuous control with deep reinforcement learning , 2015, ICLR.

[38] Yee Whye Teh,et al. Mix&Match - Agent Curricula for Reinforcement Learning , 2018, ICML.

[39] Thomas G. Dietterich. The MAXQ Method for Hierarchical Reinforcement Learning , 1998, ICML.

[40] Martin L. Puterman,et al. Markov Decision Processes: Discrete Stochastic Dynamic Programming , 1994 .

[41] Naomi S. Altman,et al. Quantile regression , 2019, Nature Methods.

[42] Friedhelm Schwenker,et al. Ensemble Methods for Reinforcement Learning with Function Approximation , 2011, MCS.

[43] Sergey Levine,et al. Learning Robust Rewards with Adversarial Inverse Reinforcement Learning , 2017, ICLR 2017.

[44] Jun Song,et al. Optimistic Distributionally Robust Policy Optimization , 2020, ArXiv.