Smoothed Dual Embedding Control

We revisit the Bellman optimality equation with Nesterov's smoothing technique and provide a unique saddle-point optimization perspective of the policy optimization problem in reinforcement learning based on Fenchel duality. A new reinforcement learning algorithm, called Smoothed Dual Embedding Control or SDEC, is derived to solve the saddle-point reformulation with arbitrary learnable function approximator. The algorithm bypasses the policy evaluation step in the policy optimization from a principled scheme and is extensible to integrate with multi-step bootstrapping and eligibility traces. We provide a PAC-learning bound on the number of samples needed from one single off-policy sample path, and also characterize the convergence of the algorithm. Finally, we show the algorithm compares favorably to the state-of-the-art baselines on several benchmark control problems.

[1]  Sean P. Meyn,et al.  An analysis of reinforcement learning with function approximation , 2008, ICML '08.

[2]  Richard S. Sutton,et al.  Learning to predict by the methods of temporal differences , 1988, Machine Learning.

[3]  Csaba Szepesvári,et al.  Learning Near-Optimal Policies with Bellman-Residual Minimization Based Fitted Policy Iteration and a Single Sample Path , 2006, COLT.

[4]  Lihong Li,et al.  Stochastic Variance Reduction Methods for Policy Evaluation , 2017, ICML.

[5]  Alec Radford,et al.  Proximal Policy Optimization Algorithms , 2017, ArXiv.

[6]  John N. Tsitsiklis,et al.  Analysis of temporal-difference learning with function approximation , 1996, NIPS 1996.

[7]  Martin L. Puterman,et al.  Markov Decision Processes: Discrete Stochastic Dynamic Programming , 1994 .

[8]  Kavosh Asadi,et al.  An Alternative Softmax Operator for Reinforcement Learning , 2016, ICML.

[9]  Shalabh Bhatnagar,et al.  Fast gradient-descent methods for temporal-difference learning with linear function approximation , 2009, ICML '09.

[10]  Yuval Tassa,et al.  MuJoCo: A physics engine for model-based control , 2012, 2012 IEEE/RSJ International Conference on Intelligent Robots and Systems.

[11]  Michail G. Lagoudakis,et al.  Least-Squares Policy Iteration , 2003, J. Mach. Learn. Res..

[12]  Pieter Abbeel,et al.  Equivalence Between Policy Gradients and Soft Q-Learning , 2017, ArXiv.

[13]  Alex Graves,et al.  Asynchronous Methods for Deep Reinforcement Learning , 2016, ICML.

[14]  Saeed Ghadimi,et al.  Stochastic First- and Zeroth-Order Methods for Nonconvex Stochastic Programming , 2013, SIAM J. Optim..

[15]  Ali H. Sayed,et al.  Distributed Policy Evaluation Under Multiple Behavior Strategies , 2013, IEEE Transactions on Automatic Control.

[16]  Dimitri P. Bertsekas,et al.  Nonlinear Programming , 1997 .

[17]  Richard S. Sutton,et al.  Reinforcement Learning: An Introduction , 1998, IEEE Trans. Neural Networks.

[18]  Dirk Ormoneit,et al.  Kernel-Based Reinforcement Learning , 2017, Encyclopedia of Machine Learning and Data Mining.

[19]  Richard S. Sutton,et al.  Generalization in Reinforcement Learning: Successful Examples Using Sparse Coarse Coding , 1995, NIPS.

[20]  David Haussler,et al.  Sphere Packing Numbers for Subsets of the Boolean n-Cube with Bounded Vapnik-Chervonenkis Dimension , 1995, J. Comb. Theory, Ser. A.

[21]  Alexander Shapiro,et al.  Stochastic Approximation approach to Stochastic Programming , 2013 .

[22]  Shane Legg,et al.  Human-level control through deep reinforcement learning , 2015, Nature.

[23]  Philip Bachman,et al.  Deep Reinforcement Learning that Matters , 2017, AAAI.

[24]  Sergey Levine,et al.  Reinforcement Learning with Deep Energy-Based Policies , 2017, ICML.

[25]  Dale Schuurmans,et al.  Trust-PCL: An Off-Policy Trust Region Method for Continuous Control , 2017, ICLR.

[26]  P. Schweitzer,et al.  Generalized polynomial approximations in Markovian decision processes , 1985 .

[27]  Sham M. Kakade,et al.  A Natural Policy Gradient , 2001, NIPS.

[28]  Sepp Hochreiter,et al.  GANs Trained by a Two Time-Scale Update Rule Converge to a Local Nash Equilibrium , 2017, NIPS.

[29]  Dale Schuurmans,et al.  Bridging the Gap Between Value and Policy Based Reinforcement Learning , 2017, NIPS.

[30]  Mengdi Wang,et al.  Randomized Linear Programming Solves the Discounted Markov Decision Problem In Nearly-Linear Running Time , 2017, ArXiv.

[31]  Csaba Szepesvári,et al.  Learning near-optimal policies with Bellman-residual minimization based fitted policy iteration and a single sample path , 2006, Machine Learning.

[32]  Stephen P. Boyd,et al.  Convex Optimization , 2004, Algorithms and Theory of Computation Handbook.

[33]  Vicenç Gómez,et al.  A unified view of entropy-regularized Markov decision processes , 2017, ArXiv.

[34]  Ronald J. Williams,et al.  Simple Statistical Gradient-Following Algorithms for Connectionist Reinforcement Learning , 2004, Machine Learning.

[35]  Marek Petrik,et al.  Finite-Sample Analysis of Proximal Gradient TD Algorithms , 2015, UAI.

[36]  Xiaohong Chen,et al.  MIXING AND MOMENT PROPERTIES OF VARIOUS GARCH AND STOCHASTIC VOLATILITY MODELS , 2002, Econometric Theory.

[37]  Yang Liu,et al.  Stein Variational Policy Gradient , 2017, UAI.

[38]  Sergey Levine,et al.  Soft Actor-Critic: Off-Policy Maximum Entropy Deep Reinforcement Learning with a Stochastic Actor , 2018, ICML.

[39]  Kavosh Asadi,et al.  A New Softmax Operator for Reinforcement Learning , 2016, ArXiv.

[40]  Andrew W. Moore,et al.  Generalization in Reinforcement Learning: Safely Approximating the Value Function , 1994, NIPS.

[41]  Naftali Tishby,et al.  Trading Value and Information in MDPs , 2012 .

[42]  Mengdi Wang,et al.  Stochastic Primal-Dual Methods and Sample Complexity of Reinforcement Learning , 2016, ArXiv.

[43]  Mahesan Niranjan,et al.  On-line Q-learning using connectionist systems , 1994 .

[44]  Ben J. A. Kröse,et al.  Learning from delayed rewards , 1995, Robotics Auton. Syst..

[45]  Benjamin Van Roy,et al.  The Linear Programming Approach to Approximate Dynamic Programming , 2003, Oper. Res..

[46]  Yurii Nesterov,et al.  Smooth minimization of non-smooth functions , 2005, Math. Program..

[47]  Michael Kearns,et al.  Near-Optimal Reinforcement Learning in Polynomial Time , 2002, Machine Learning.

[48]  Le Song,et al.  Learning from Conditional Distributions via Dual Kernel Embeddings , 2016, ArXiv.

[49]  R. Sutton,et al.  Gradient temporal-difference learning algorithms , 2011 .

[50]  Leemon C. Baird,et al.  Residual Algorithms: Reinforcement Learning with Function Approximation , 1995, ICML.

[51]  Francis R. Bach,et al.  Breaking the Curse of Dimensionality with Convex Neural Networks , 2014, J. Mach. Learn. Res..

[52]  Yuval Tassa,et al.  Continuous control with deep reinforcement learning , 2015, ICLR.

[53]  V. Borkar Stochastic approximation with two time scales , 1997 .

[54]  Geoffrey J. Gordon Stable Function Approximation in Dynamic Programming , 1995, ICML.

[55]  Alex M. Andrew,et al.  Reinforcement Learning: : An Introduction , 1998 .

[56]  Bo Liu,et al.  Proximal Reinforcement Learning: A New Theory of Sequential Decision Making in Primal-Dual Spaces , 2014, ArXiv.

[57]  Sergey Levine,et al.  Trust Region Policy Optimization , 2015, ICML.

[58]  Le Song,et al.  Learning from Conditional Distributions via Dual Embeddings , 2016, AISTATS.

[59]  Benjamin Van Roy,et al.  On the existence of fixed points for approximate value iteration and temporal-difference learning , 2000 .

[60]  Saeed Ghadimi,et al.  Mini-batch stochastic approximation methods for nonconvex stochastic composite optimization , 2013, Mathematical Programming.

[61]  Marc Toussaint,et al.  On Stochastic Optimal Control and Reinforcement Learning by Approximate Inference , 2012, Robotics: Science and Systems.

[62]  Roy Fox,et al.  Taming the Noise in Reinforcement Learning via Soft Updates , 2015, UAI.

[63]  Sham M. Kakade,et al.  Towards Generalization and Simplicity in Continuous Control , 2017, NIPS.

[64]  Bo Liu Finite-Sample Analysis of Proximal Gradient TD Algorithms , 2019 .

[65]  John N. Tsitsiklis,et al.  Neuro-Dynamic Programming , 1996, Encyclopedia of Machine Learning.

[66]  Le Song,et al.  Boosting the Actor with Dual Critic , 2017, ICLR.

[67]  Le Song,et al.  Scalable Kernel Methods via Doubly Stochastic Gradients , 2014, NIPS.

[68]  Emanuel Todorov,et al.  Linearly-solvable Markov decision problems , 2006, NIPS.