A Non-Parametric Approach to Dynamic Programming

In this paper, we consider the problem of policy evaluation for continuous-state systems. We present a non-parametric approach to policy evaluation, which uses kernel density estimation to represent the system. The true form of the value function for this model can be determined, and can be computed using Galerkin's method. Furthermore, we also present a unified view of several well-known policy evaluation methods. In particular, we show that the same Galerkin method can be used to derive Least-Squares Temporal Difference learning, Kernelized Temporal Difference learning, and a discrete-state Dynamic Programming solution, as well as our proposed method. In a numerical evaluation of these algorithms, the proposed approach performed better than the other methods.

[1]  R Bellman,et al.  Bottleneck Problems and Dynamic Programming. , 1953, Proceedings of the National Academy of Sciences of the United States of America.

[2]  M. Rosenblatt Remarks on Some Nonparametric Estimates of a Density Function , 1956 .

[3]  R. E. Kalman,et al.  Contributions to the Theory of Optimal Control , 1960 .

[4]  E. Parzen On Estimation of a Probability Density Function and Mode , 1962 .

[5]  E. Nadaraya On Estimating Regression , 1964 .

[6]  G. S. Watson,et al.  Smooth regression analysis , 1964 .

[7]  E. Davison,et al.  The numerical solution of A'Q+QA =-C , 1968 .

[8]  David Elkind,et al.  Learning: An Introduction , 1968 .

[9]  G. Wahba,et al.  Some results on Tchebycheffian spline functions , 1971 .

[10]  Dimitri P. Bertsekas,et al.  Dynamic Programming and Optimal Control, Vol. II , 1976 .

[11]  C. D. Kemp,et al.  Density Estimation for Statistics and Data Analysis , 1987 .

[12]  Dimitri P. Bertsekas,et al.  Dynamic Programming and Optimal Control, Two Volume Set , 1995 .

[13]  Leemon C. Baird,et al.  Residual Algorithms: Reinforcement Learning with Function Approximation , 1995, ICML.

[14]  John N. Tsitsiklis,et al.  Neuro-Dynamic Programming , 1996, Encyclopedia of Machine Learning.

[15]  H. Bersini,et al.  Three connectionist implementations of dynamic programming for optimal control: a preliminary comparative analysis , 1996, Proceedings of International Workshop on Neural Networks for Identification, Control, Robotics and Signal/Image Processing.

[16]  K. Atkinson The Numerical Solution of Integral Equations of the Second Kind , 1997 .

[17]  Christopher G. Atkeson,et al.  A comparison of direct and model-based reinforcement learning , 1997, Proceedings of International Conference on Robotics and Automation.

[18]  Richard S. Sutton,et al.  Introduction to Reinforcement Learning , 1998 .

[19]  Justin A. Boyan,et al.  Least-Squares Temporal Difference Learning , 1999, ICML.

[20]  Ralf Schoknecht,et al.  Optimality of Reinforcement Learning Algorithms with Linear Function Approximation , 2002, NIPS.

[21]  Rémi Munos,et al.  Error Bounds for Approximate Policy Iteration , 2003, ICML.

[22]  Richard S. Sutton,et al.  Reinforcement Learning: An Introduction , 1998, IEEE Trans. Neural Networks.

[23]  Shie Mannor,et al.  Reinforcement learning with Gaussian processes , 2005, ICML.

[24]  Liming Xiang,et al.  Kernel-Based Reinforcement Learning , 2006, ICIC.

[25]  Rémi Munos,et al.  Geometric Variance Reduction in Markov Chains: Application to Value Function and Gradient Estimation , 2005, J. Mach. Learn. Res..

[26]  Xin Xu,et al.  Kernel Least-Squares Temporal Difference Learning , 2006 .

[27]  Peter Stone,et al.  Model-based function approximation in reinforcement learning , 2007, AAMAS '07.

[28]  Warren B. Powell,et al.  Approximate Dynamic Programming: Solving the Curses of Dimensionality (Wiley Series in Probability and Statistics) , 2007 .

[29]  Gavin Taylor,et al.  Kernelized value function approximation for reinforcement learning , 2009, ICML '09.

[30]  Andrew Y. Ng,et al.  Regularization and feature selection in least-squares temporal difference learning , 2009, ICML '09.

[31]  Shalabh Bhatnagar,et al.  Convergent Temporal-Difference Learning with Arbitrary Smooth Function Approximation , 2009, NIPS.

[32]  Jan Peters,et al.  Noname manuscript No. (will be inserted by the editor) Policy Search for Motor Primitives in Robotics , 2022 .

[33]  Dominik Wied,et al.  Consistency of the kernel density estimator: a survey , 2012 .

[34]  Jan Peters,et al.  A biomimetic approach to robot table tennis , 2010, 2010 IEEE/RSJ International Conference on Intelligent Robots and Systems.

[35]  Warren B. Powell,et al.  “Approximate dynamic programming: Solving the curses of dimensionality” by Warren B. Powell , 2007, Wiley Series in Probability and Statistics.