论文信息 - Optimistic Policy Optimization via Multiple Importance Sampling

Optimistic Policy Optimization via Multiple Importance Sampling

Policy Search (PS) is an effective approach to Reinforcement Learning (RL) for solving control tasks with continuous state-action spaces. In this paper, we address the exploration-exploitation trade-off in PS by proposing an approach based on Optimism in the Face of Uncertainty. We cast the PS problem as a suitable Multi Armed Bandit (MAB) problem, defined over the policy parameter space, and we propose a class of algorithms that effectively exploit the problem structure, by leveraging Multiple Importance Sampling to perform an off-policy estimation of the expected return. We show that the regret of the proposed approach is bounded by r Op ? T q for both discrete and continuous parameter spaces. Finally, we evaluate our algorithms on tasks of varying difficulty, comparing them with existing MAB and RL algorithms.

[1] Andreas Krause,et al. Information-Theoretic Regret Bounds for Gaussian Process Optimization in the Bandit Setting , 2009, IEEE Transactions on Information Theory.

[2] Richard S. Sutton,et al. Reinforcement Learning: An Introduction , 1998, IEEE Trans. Neural Networks.

[3] Sergey Levine,et al. Soft Actor-Critic: Off-Policy Maximum Entropy Deep Reinforcement Learning with a Stochastic Actor , 2018, ICML.

[4] Tom Schaul,et al. Unifying Count-Based Exploration and Intrinsic Motivation , 2016, NIPS.

[5] Chaouki T. Abdallah,et al. Linear Quadratic Control: An Introduction , 2000 .

[6] Cem Tekin,et al. Combinatorial multi-armed bandit problem with probabilistically triggered arms: A case with bounded regret , 2017, 2017 IEEE Global Conference on Signal and Information Processing (GlobalSIP).

[7] Rémi Munos,et al. Thompson Sampling: An Asymptotically Optimal Finite-Time Analysis , 2012, ALT.

[8] T. L. Lai Andherbertrobbins. Asymptotically Efficient Adaptive Allocation Rules , 1985 .

[9] Sergey Levine,et al. Reinforcement Learning with Deep Energy-Based Policies , 2017, ICML.

[10] Nicolò Cesa-Bianchi,et al. Combinatorial Bandits , 2012, COLT.

[11] Michael L. Littman,et al. Bandit-Based Planning and Learning in Continuous-Action Markov Decision Processes , 2012, ICAPS.

[12] Filip De Turck,et al. VIME: Variational Information Maximizing Exploration , 2016, NIPS.

[13] Robert D. Kleinberg. Nearly Tight Bounds for the Continuum-Armed Bandit Problem , 2004, NIPS.

[14] A. Rényi. On Measures of Entropy and Information , 1961 .

[15] Sergey Levine,et al. Trust Region Policy Optimization , 2015, ICML.

[16] Lihong Li,et al. Reinforcement Learning in Finite MDPs: PAC Analysis , 2009, J. Mach. Learn. Res..

[17] Peter Auer,et al. Using Confidence Bounds for Exploitation-Exploration Trade-offs , 2003, J. Mach. Learn. Res..

[18] A. Winsor. Sampling techniques. , 2000, Nursing times.

[19] Deepayan Chakrabarti,et al. Multi-armed bandit problems with dependent arms , 2007, ICML '07.

[20] Yajun Wang,et al. Combinatorial Multi-Armed Bandit and Its Extension to Probabilistically Triggered Arms , 2014, J. Mach. Learn. Res..

[21] Ronald Ortner,et al. Online Regret Bounds for Undiscounted Continuous Reinforcement Learning , 2012, NIPS.

[22] Peter Auer,et al. Finite-time Analysis of the Multiarmed Bandit Problem , 2002, Machine Learning.

[23] Marcello Restelli,et al. Policy Optimization via Importance Sampling , 2018, NeurIPS.

[24] Rémi Munos,et al. Open Loop Optimistic Planning , 2010, COLT.

[25] Michael Kearns,et al. Near-Optimal Reinforcement Learning in Polynomial Time , 1998, Machine Learning.

[26] J.N. Tsitsiklis,et al. A structured multiarmed bandit problem and the greedy policy , 2008, 2008 47th IEEE Conference on Decision and Control.

[27] Fady Alajaji,et al. Rényi divergence measures for commonly used univariate continuous distributions , 2013, Inf. Sci..

[28] Peter Auer,et al. Improved Rates for the Stochastic Continuum-Armed Bandit Problem , 2007, COLT.

[29] Benjamin Van Roy,et al. (More) Efficient Reinforcement Learning via Posterior Sampling , 2013, NIPS.

[30] Stefan Schaal,et al. 2008 Special Issue: Reinforcement learning of motor skills with policy gradients , 2008 .

[31] Michael L. Littman,et al. An analysis of model-based Interval Estimation for Markov Decision Processes , 2008, J. Comput. Syst. Sci..

[32] Nathan Kallus,et al. Instrument-Armed Bandits , 2017, ALT.

[33] R. Agrawal. Sample mean based index policies by O(log n) regret for the multi-armed bandit problem , 1995, Advances in Applied Probability.

[34] Lihong Li,et al. An Empirical Evaluation of Thompson Sampling , 2011, NIPS.

[35] Cong Shen,et al. Regional Multi-Armed Bandits , 2018, AISTATS.

[36] Jan Peters,et al. A Survey on Policy Search for Robotics , 2013, Found. Trends Robotics.

[37] Christoph Dann,et al. Sample Complexity of Episodic Fixed-Horizon Reinforcement Learning , 2015, NIPS.

[38] Eli Upfal,et al. Multi-Armed Bandits in Metric Spaces ∗ , 2008 .

[39] Csaba Szepesvári,et al. Improved Algorithms for Linear Stochastic Bandits , 2011, NIPS.

[40] H. Robbins. Some aspects of the sequential design of experiments , 1952 .

[41] Peter Auer,et al. Near-optimal Regret Bounds for Reinforcement Learning , 2008, J. Mach. Learn. Res..

[42] Eli Upfal,et al. Bandits and Experts in Metric Spaces , 2013, J. ACM.

[43] Michael I. Jordan,et al. Is Q-learning Provably Efficient? , 2018, NeurIPS.

[44] Alex Graves,et al. Asynchronous Methods for Deep Reinforcement Learning , 2016, ICML.

[45] E. Ionides. Truncated Importance Sampling , 2008 .

[46] Tor Lattimore,et al. Near-optimal PAC bounds for discounted MDPs , 2014, Theor. Comput. Sci..

[47] Mónica F. Bugallo,et al. Efficient Multiple Importance Sampling Estimators , 2015, IEEE Signal Processing Letters.

[48] Aditya Gopalan,et al. Online Learning in Kernelized Markov Decision Processes , 2019, AISTATS.

[49] Yishay Mansour,et al. Learning Bounds for Importance Weighting , 2010, NIPS.

[50] Csaba Szepesvári,et al. Online Optimization in X-Armed Bandits , 2008, NIPS.

[51] Gábor Lugosi,et al. Concentration Inequalities - A Nonasymptotic Theory of Independence , 2013, Concentration Inequalities.

[52] Sébastien Bubeck,et al. Regret Analysis of Stochastic and Nonstochastic Multi-armed Bandit Problems , 2012, Found. Trends Mach. Learn..

[53] Martin L. Puterman,et al. Markov Decision Processes: Discrete Stochastic Dynamic Programming , 1994 .

[54] Frank Sehnke,et al. Policy Gradients with Parameter-Based Exploration for Control , 2008, ICANN.

[55] W. R. Thompson. ON THE LIKELIHOOD THAT ONE UNKNOWN PROBABILITY EXCEEDS ANOTHER IN VIEW OF THE EVIDENCE OF TWO SAMPLES , 1933 .

[56] Mihaela van der Schaar,et al. Global Multi-armed Bandits with Hölder Continuity , 2014, AISTATS.

[57] Ronen I. Brafman,et al. R-MAX - A General Polynomial Time Algorithm for Near-Optimal Reinforcement Learning , 2001, J. Mach. Learn. Res..

[58] Nicolò Cesa-Bianchi,et al. Bandits With Heavy Tail , 2012, IEEE Transactions on Information Theory.

[59] Joaquín Míguez,et al. A population Monte Carlo scheme with transformed weights and its application to stochastic kinetic models , 2012, Stat. Comput..

[60] Tor Lattimore,et al. Unifying PAC and Regret: Uniform PAC Bounds for Episodic Reinforcement Learning , 2017, NIPS.

[61] Leonidas J. Guibas,et al. Optimally combining sampling techniques for Monte Carlo rendering , 1995, SIGGRAPH.

[62] John R. Hershey,et al. Approximating the Kullback Leibler Divergence Between Gaussian Mixture Models , 2007, 2007 IEEE International Conference on Acoustics, Speech and Signal Processing - ICASSP '07.

[63] Guy Lever,et al. Deterministic Policy Gradient Algorithms , 2014, ICML.

[64] Ronald Ortner,et al. Improved Regret Bounds for Undiscounted Continuous Reinforcement Learning , 2015, ICML.

[65] Yishay Mansour,et al. Policy Gradient Methods for Reinforcement Learning with Function Approximation , 1999, NIPS.