论文信息 - Smoothed Dual Embedding Control

Smoothed Dual Embedding Control

We revisit the Bellman optimality equation with Nesterov's smoothing technique and provide a unique saddle-point optimization perspective of the policy optimization problem in reinforcement learning based on Fenchel duality. A new reinforcement learning algorithm, called Smoothed Dual Embedding Control or SDEC, is derived to solve the saddle-point reformulation with arbitrary learnable function approximator. The algorithm bypasses the policy evaluation step in the policy optimization from a principled scheme and is extensible to integrate with multi-step bootstrapping and eligibility traces. We provide a PAC-learning bound on the number of samples needed from one single off-policy sample path, and also characterize the convergence of the algorithm. Finally, we show the algorithm compares favorably to the state-of-the-art baselines on several benchmark control problems.

[1] Sean P. Meyn,et al. An analysis of reinforcement learning with function approximation , 2008, ICML '08.

[2] Richard S. Sutton,et al. Learning to predict by the methods of temporal differences , 1988, Machine Learning.

[3] Csaba Szepesvári,et al. Learning Near-Optimal Policies with Bellman-Residual Minimization Based Fitted Policy Iteration and a Single Sample Path , 2006, COLT.

[4] Lihong Li,et al. Stochastic Variance Reduction Methods for Policy Evaluation , 2017, ICML.

[5] Alec Radford,et al. Proximal Policy Optimization Algorithms , 2017, ArXiv.

[6] John N. Tsitsiklis,et al. Analysis of temporal-difference learning with function approximation , 1996, NIPS 1996.

[7] Martin L. Puterman,et al. Markov Decision Processes: Discrete Stochastic Dynamic Programming , 1994 .

[8] Kavosh Asadi,et al. An Alternative Softmax Operator for Reinforcement Learning , 2016, ICML.

[9] Shalabh Bhatnagar,et al. Fast gradient-descent methods for temporal-difference learning with linear function approximation , 2009, ICML '09.

[10] Yuval Tassa,et al. MuJoCo: A physics engine for model-based control , 2012, 2012 IEEE/RSJ International Conference on Intelligent Robots and Systems.

[11] Michail G. Lagoudakis,et al. Least-Squares Policy Iteration , 2003, J. Mach. Learn. Res..

[12] Pieter Abbeel,et al. Equivalence Between Policy Gradients and Soft Q-Learning , 2017, ArXiv.

[13] Alex Graves,et al. Asynchronous Methods for Deep Reinforcement Learning , 2016, ICML.

[14] Saeed Ghadimi,et al. Stochastic First- and Zeroth-Order Methods for Nonconvex Stochastic Programming , 2013, SIAM J. Optim..

[15] Ali H. Sayed,et al. Distributed Policy Evaluation Under Multiple Behavior Strategies , 2013, IEEE Transactions on Automatic Control.

[16] Dimitri P. Bertsekas,et al. Nonlinear Programming , 1997 .

[17] Richard S. Sutton,et al. Reinforcement Learning: An Introduction , 1998, IEEE Trans. Neural Networks.

[18] Dirk Ormoneit,et al. Kernel-Based Reinforcement Learning , 2017, Encyclopedia of Machine Learning and Data Mining.

[19] Richard S. Sutton,et al. Generalization in Reinforcement Learning: Successful Examples Using Sparse Coarse Coding , 1995, NIPS.

[20] David Haussler,et al. Sphere Packing Numbers for Subsets of the Boolean n-Cube with Bounded Vapnik-Chervonenkis Dimension , 1995, J. Comb. Theory, Ser. A.

[21] Alexander Shapiro,et al. Stochastic Approximation approach to Stochastic Programming , 2013 .

[22] Shane Legg,et al. Human-level control through deep reinforcement learning , 2015, Nature.