A Study of Reinforcement Learning in the Continuous Case by the Means of Viscosity Solutions

This paper proposes a study of Reinforcement Learning (RL) for continuous state-space and time control problems, based on the theoretical framework of viscosity solutions (VSs). We use the method of dynamic programming (DP) which introduces the value function (VF), expectation of the best future cumulative reinforcement. In the continuous case, the value function satisfies a non-linear first (or second) order (depending on the deterministic or stochastic aspect of the process) differential equation called the Hamilton-Jacobi-Bellman (HJB) equation. It is well known that there exists an infinity of generalized solutions (differentiable almost everywhere) to this equation, other than the VF. We show that gradient-descent methods may converge to one of these generalized solutions, thus failing to find the optimal control.In order to solve the HJB equation, we use the powerful framework of viscosity solutions and state that there exists a unique viscosity solution to the HJB equation, which is the value function. Then, we use another main result of VSs (their stability when passing to the limit) to prove the convergence of numerical approximations schemes based on finite difference (FD) and finite element (FE) methods. These methods discretize, at some resolution, the HJB equation into a DP equation of a Markov Decision Process (MDP), which can be solved by DP methods (thanks to a “strong” contraction property) if all the initial data (the state dynamics and the reinforcement function) were perfectly known. However, in the RL approach, as we consider a system in interaction with some a priori (at least partially) unknown environment, which learns “from experience”, the initial data are not perfectly known but have to be approximated during learning. The main contribution of this work is to derive a general convergence theorem for RL algorithms when one uses only “approximations” (in a sense of satisfying some “weak” contraction property) of the initial data. This result can be used for model-based or model-free RL algorithms, with off-line or on-line updating methods, for deterministic or stochastic state dynamics (though this latter case is not described here), and based on FE or FD discretization methods. It is illustrated with several RL algorithms and one numerical simulation for the “Car on the Hill” problem.

[1]  L. S. Pontryagin,et al.  Mathematical Theory of Optimal Processes , 1962 .

[2]  E. Blum,et al.  The Mathematical Theory of Optimal Processes. , 1963 .

[3]  Richard S. Sutton,et al.  Neuronlike adaptive elements that can solve difficult learning control problems , 1983, IEEE Transactions on Systems, Man, and Cybernetics.

[4]  Dimitri P. Bertsekas,et al.  Dynamic Programming: Deterministic and Stochastic Models , 1987 .

[5]  G. Barles,et al.  Exit Time Problems in Optimal Control and Vanishing Viscosity Method , 1988 .

[6]  Marianne Akian Méthodes multigrilles en contrôle stochastique , 1990 .

[7]  Richard S. Sutton,et al.  Neural networks for control , 1990 .

[8]  G. Barles,et al.  Comparison principle for dirichlet-type Hamilton-Jacobi equations and singular perturbations of degenerated elliptic equations , 1990 .

[9]  G. Barles,et al.  Convergence of approximation schemes for fully nonlinear second order equations , 1990, 29th IEEE Conference on Decision and Control.

[10]  Andrew G. Barto,et al.  Connectionist learning for control: an overview , 1990 .

[11]  A. Moore Variable Resolution Dynamic Programming , 1991, ML.

[12]  Long-Ji Lin,et al.  Reinforcement learning for robots using neural networks , 1992 .

[13]  Sridhar Mahadevan,et al.  Automatic Programming of Behavior-Based Robots Using Reinforcement Learning , 1991, Artif. Intell..

[14]  W. Fleming,et al.  Controlled Markov processes and viscosity solutions , 1992 .

[15]  Vijaykumar Gullapalli,et al.  Reinforcement learning and its application to control , 1992 .

[16]  P. Lions,et al.  User’s guide to viscosity solutions of second order partial differential equations , 1992, math/9207212.

[17]  Richard S. Sutton,et al.  Online Learning with Random Representations , 1993, ICML.

[18]  Andrew W. Moore,et al.  The parti-game algorithm for variable resolution reinforcement learning in multidimensional state-spaces , 2004, Machine Learning.

[19]  Eduardo D. Sontag,et al.  Neural Networks for Control , 1993 .

[20]  G. Barles Solutions de viscosité des équations de Hamilton-Jacobi , 1994 .

[21]  Andrew W. Moore,et al.  Generalization in Reinforcement Learning: Safely Approximating the Value Function , 1994, NIPS.

[22]  Martin L. Puterman,et al.  Markov Decision Processes: Discrete Stochastic Dynamic Programming , 1994 .

[23]  J. Quadrat Numerical methods for stochastic control problems in continuous time , 1994 .

[24]  Michael I. Jordan,et al.  Reinforcement Learning with Soft State Aggregation , 1994, NIPS.

[25]  Geoffrey J. Gordon Stable Function Approximation in Dynamic Programming , 1995, ICML.

[26]  Kenji Doya,et al.  Temporal Difference Learning in Continuous Time and Space , 1995, NIPS.

[27]  A. Harry Klopf,et al.  Reinforcement Learning Applied to a Differential Game , 1995, Adapt. Behav..

[28]  Richard S. Sutton,et al.  Generalization in Reinforcement Learning: Successful Examples Using Sparse Coarse Coding , 1995, NIPS.

[29]  Leemon C. Baird,et al.  Residual Algorithms: Reinforcement Learning with Function Approximation , 1995, ICML.

[30]  김재현,et al.  Fuzzy-Q learning , 1996 .

[31]  Andrew W. Moore,et al.  Reinforcement Learning: A Survey , 1996, J. Artif. Intell. Res..

[32]  Rémi Munos,et al.  A Convergent Reinforcement Learning Algorithm in the Continuous Case: The Finite-Element Reinforcement Learning , 1996, ICML.

[33]  Stephan Pareigis,et al.  Multi-Grid Methods for Reinforcement Learning in Controlled Diffusion Processes , 1996, NIPS.

[34]  Nicolas Meuleau Le dilemme entre exploration et exploitation dans l'apprentissage par renforcement : optimisation adaptative des modeles de decision multi-etats , 1996 .

[35]  John N. Tsitsiklis,et al.  Neuro-Dynamic Programming , 1996, Encyclopedia of Machine Learning.

[36]  M. Hasselmo,et al.  Temporal Diierence Learning in Continuous Time and Space , 1996 .

[37]  Rémi Munos,et al.  Finite-Element Methods with Local Triangulation Refinement for Continuous Reimforcement Learning Problems , 1997, ECML.

[38]  Hugues Bersini,et al.  A simplification of the backpropagation-through-time algorithm for optimal neurocontrol , 1997, IEEE Trans. Neural Networks.

[39]  Rémi Munos,et al.  Reinforcement Learning for Continuous Stochastic Control Problems , 1997, NIPS.

[40]  R. Emi Munos,et al.  A Convergent Reinforcement Learning Algorithm in the Continuous Case : the Finite-element Reinforcement Learning , 1997 .

[41]  Stephan Pareigis,et al.  Adaptive Choice of Grid and Time in Reinforcement Learning , 1997, NIPS.

[42]  Rémi Munos Apprentissage par renforcement, étude du cas continu , 1997 .

[43]  Rémi Munos,et al.  A Convergent Reinforcement Learning Algorithm in the Continuous Case Based on a Finite Difference Method , 1997, IJCAI.

[44]  Andrew W. Moore,et al.  Barycentric Interpolators for Continuous Space and Time Reinforcement Learning , 1998, NIPS.

[45]  P. Dupuis,et al.  Rates of Convergence for Approximation Schemes in Optimal Control , 1998 .

[46]  Rémi Munos,et al.  A General Convergence Method for Reinforcement Learning in the Continuous Case , 1998, ECML.

[47]  Andrew W. Moore,et al.  Variable Resolution Discretization for High-Accuracy Solutions of Optimal Control Problems , 1999, IJCAI.

[48]  Andrew W. Moore,et al.  Gradient descent approaches to neural-net-based solutions of the Hamilton-Jacobi-Bellman equation , 1999, IJCNN'99. International Joint Conference on Neural Networks. Proceedings (Cat. No.99CH36339).

[49]  H. Kushner Numerical Methods for Stochastic Control Problems in Continuous Time , 2000 .

[50]  M. K. Ali,et al.  Fuzzy Reinforcement Learning , 2002, Fuzzy Logic Theory and Applications.

[51]  Andrew W. Moore,et al.  The Parti-game Algorithm for Variable Resolution Reinforcement Learning in Multidimensional State-spaces , 1993, Machine Learning.

[52]  Ronald J. Williams,et al.  Simple Statistical Gradient-Following Algorithms for Connectionist Reinforcement Learning , 2004, Machine Learning.

[53]  Sean R Eddy,et al.  What is dynamic programming? , 2004, Nature Biotechnology.

[54]  Rmi Munos Finite-Element Methods with Local Triangulation Refinement for Continuous Reinforcement Learning Problems , 2005 .

[55]  .. Griebel Adaptive sparse grid multilevel methods for ellipticPDEs based on nite di erencesM , .