Approximate Value Iteration Based on Numerical Quadrature

Learning control policies has become an appealing alternative to the derivation of control laws based on classic control theory. Value iteration approaches have proven an outstanding flexibility, while maintaining high data efficiency when combined with probabilistic models to eliminate model bias. However, a major difficulty for these methods is that the state and action spaces must typically be discretized and often the value function update is analytically intractable. In this letter, we propose a projection based approximate value iteration approach, that employs numerical quadrature for the value function update step. It can handle continuous state and action spaces and noisy measurements of the system dynamics while learning globally optimal control from scratch. In addition, the proposed approximation technique allows for upper bounds on the approximation error, which can be used to guarantee convergence of the proposed approach to an optimal policy under some assumptions. Empirical evaluations on the mountain benchmark problem show the efficiency of the proposed approach and support our theoretical results.

[1]  Richard S. Sutton,et al.  Integrated Architectures for Learning, Planning, and Reacting Based on Approximating Dynamic Programming , 1990, ML.

[2]  Andrew W. Moore,et al.  The parti-game algorithm for variable resolution reinforcement learning in multidimensional state-spaces , 2004, Machine Learning.

[3]  Dimitri P. Bertsekas,et al.  Dynamic Programming and Optimal Control, Two Volume Set , 1995 .

[4]  Richard S. Sutton,et al.  Model-Based Reinforcement Learning with an Approximate, Learned Model , 1996 .

[5]  Ian Postlethwaite,et al.  Multivariable Feedback Control: Analysis and Design , 1996 .

[6]  John N. Tsitsiklis,et al.  Neuro-Dynamic Programming , 1996, Encyclopedia of Machine Learning.

[7]  Jeff G. Schneider,et al.  Exploiting Model Uncertainty Estimates for Safe Dynamic Control Learning , 1996, NIPS.

[8]  Christopher G. Atkeson,et al.  A comparison of direct and model-based reinforcement learning , 1997, Proceedings of International Conference on Robotics and Automation.

[9]  J. Doyle,et al.  Essentials of Robust Control , 1997 .

[10]  C. Rasmussen,et al.  Gaussian Process Priors with Uncertain Inputs - Application to Multiple-Step Ahead Time Series Forecasting , 2002, NIPS.

[11]  Carl E. Rasmussen,et al.  Gaussian Processes in Reinforcement Learning , 2003, NIPS.

[12]  Gang Tao,et al.  Adaptive Control Design and Analysis , 2003 .

[13]  Michail G. Lagoudakis,et al.  Least-Squares Policy Iteration , 2003, J. Mach. Learn. Res..

[14]  J. Kocijan,et al.  Gaussian process model based predictive control , 2004, Proceedings of the 2004 American Control Conference.

[15]  Andrew W. Moore,et al.  Prioritized Sweeping: Reinforcement Learning with Less Data and Less Time , 1993, Machine Learning.

[16]  Christopher K. I. Williams,et al.  Gaussian Processes for Machine Learning (Adaptive Computation and Machine Learning) , 2005 .

[17]  Martin A. Riedmiller Neural Fitted Q Iteration - First Experiences with a Data Efficient Neural Reinforcement Learning Method , 2005, ECML.

[18]  Peter Szabó,et al.  Learning to Control an Octopus Arm with Gaussian Process Temporal Difference Methods , 2005, NIPS.

[19]  Richard S. Sutton,et al.  Reinforcement Learning: An Introduction , 1998, IEEE Trans. Neural Networks.

[20]  Bengt Fornberg,et al.  Accuracy of radial basis function interpolation and derivative approximations on 1-D infinite grids , 2005, Adv. Comput. Math..

[21]  Shie Mannor,et al.  Automatic basis function construction for approximate dynamic programming and reinforcement learning , 2006, ICML.

[22]  Florian Heiss,et al.  Likelihood approximation by numerical integration on sparse grids , 2008 .

[23]  Wolfram Burgard,et al.  Adaptive autonomous control using online value iteration with gaussian processes , 2009, 2009 IEEE International Conference on Robotics and Automation.

[24]  Carl E. Rasmussen,et al.  Gaussian process dynamic programming , 2009, Neurocomputing.

[25]  Jan Peters,et al.  Model Learning in Robotics: a Survey , 2011 .

[26]  Ronald Parr,et al.  Greedy Algorithms for Sparse Reinforcement Learning , 2012, ICML.

[27]  Doina Precup,et al.  Value Pursuit Iteration , 2012, NIPS.

[28]  Xiaoke Yang,et al.  Fault tolerant control using Gaussian processes and model predictive control , 2013 .

[29]  Alois Knoll,et al.  Learning control under uncertainty: A probabilistic Value-Iteration approach , 2013, ESANN.

[30]  Jan Peters,et al.  Data-Efficient Generalization of Robot Skills with Contextual Policy Search , 2013, AAAI.

[31]  Bernhard Schölkopf,et al.  Nonparametric dynamics estimation for time periodic systems , 2013, 2013 51st Annual Allerton Conference on Communication, Control, and Computing (Allerton).

[32]  M. Masjed-Jamei,et al.  New Error Bounds for Gauss-Legendre Quadrature Rules , 2014 .

[33]  Yunpeng Pan,et al.  Probabilistic Differential Dynamic Programming , 2014, NIPS.

[34]  Carl E. Rasmussen,et al.  Gaussian Processes for Data-Efficient Learning in Robotics and Control , 2015, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[35]  Shane Legg,et al.  Human-level control through deep reinforcement learning , 2015, Nature.

[36]  Duy Nguyen-Tuong,et al.  Stability of Controllers for Gaussian Process Forward Models , 2016, ICML.

[37]  Sandra Hirche,et al.  Stability of Gaussian process state space models , 2016, 2016 European Control Conference (ECC).