Variable Resolution Discretization in Optimal Control

The problem of state abstraction is of central importance in optimal control, reinforcement learning and Markov decision processes. This paper studies the case of variable resolution state abstraction for continuous time and space, deterministic dynamic control problems in which near-optimal policies are required. We begin by defining a class of variable resolution policy and value function representations based on Kuhn triangulations embedded in a kd-trie. We then consider top-down approaches to choosing which cells to split in order to generate improved policies. The core of this paper is the introduction and evaluation of a wide variety of possible splitting criteria. We begin with local approaches based on value function and policy properties that use only features of individual cells in making split choices. Later, by introducing two new non-local measures, influence and variance, we derive splitting criteria that allow one cell to efficiently take into account its impact on other cells when deciding whether to split. Influence is an efficiently-calculable measure of the extent to which changes in some state effect the value function of some other states. Variance is an efficiently-calculable measure of how risky is some state in a Markov chain: a low variance state is one in which we would be very surprised if, during any one execution, the long-term reward attained from that state differed substantially from its expected value, given by the value function.The paper proceeds by graphically demonstrating the various approaches to splitting on the familiar, non-linear, non-minimum phase, and two dimensional problem of the “Car on the hill”. It then evaluates the performance of a variety of splitting criteria on many benchmark problems, paying careful attention to their number-of-cells versus closeness-to-optimality tradeoff curves.

[1]  Donald E. Knuth,et al.  Sorting and Searching , 1973 .

[2]  Jon Louis Bentley,et al.  An Algorithm for Finding Best Matches in Logarithmic Expected Time , 1976, TOMS.

[3]  Hendrik Van Brussel,et al.  A self-learning automaton with variable resolution for high precision assembly by industrial robots , 1982 .

[4]  P. Lions,et al.  Viscosity solutions of Hamilton-Jacobi equations , 1983 .

[5]  Richard S. Sutton,et al.  Neuronlike adaptive elements that can solve difficult learning control problems , 1983, IEEE Transactions on Systems, Man, and Cybernetics.

[6]  Dimitri P. Bertsekas,et al.  Dynamic Programming: Deterministic and Stochastic Models , 1987 .

[7]  John Moody,et al.  Fast Learning in Networks of Locally-Tuned Processing Units , 1989, Neural Computation.

[8]  G. Barles,et al.  Convergence of approximation schemes for fully nonlinear second order equations , 1990, 29th IEEE Conference on Decision and Control.

[9]  A. Moore Variable Resolution Dynamic Programming , 1991, ML.

[10]  Wolfgang Hackbusch,et al.  Parallel algorithms for partial differential equations - Proceedings of the sixth GAMM-seminar - Kiel, January 19-21, 1990 , 1991 .

[11]  G. Barles,et al.  Convergence of approximation schemes for fully nonlinear second order equations , 1991 .

[12]  Harald Niederreiter,et al.  Random number generation and Quasi-Monte Carlo methods , 1992, CBMS-NSF regional conference series in applied mathematics.

[13]  W. Fleming,et al.  Controlled Markov processes and viscosity solutions , 1992 .

[14]  D. Moore Simplicial Mesh Generation with Applications , 1992 .

[15]  P. Lions,et al.  User’s guide to viscosity solutions of second order partial differential equations , 1992, math/9207212.

[16]  C. Atkeson,et al.  Prioritized Sweeping: Reinforcement Learning with Less Data and Less Time , 1993, Machine Learning.

[17]  Leslie Pack Kaelbling,et al.  Learning in embedded systems , 1993 .

[18]  A. Michael,et al.  A Linear Programming Approach toSolving Stochastic Dynamic Programs , 1993 .

[19]  Andrew W. Moore,et al.  The parti-game algorithm for variable resolution reinforcement learning in multidimensional state-spaces , 2004, Machine Learning.

[20]  Michael A. Trick,et al.  A Linear Programming Approach to Solving Stochastic Dynamic Programming , 1993 .

[21]  Andrew W. Moore,et al.  Generalization in Reinforcement Learning: Safely Approximating the Value Function , 1994, NIPS.

[22]  Martin L. Puterman,et al.  Markov Decision Processes: Discrete Stochastic Dynamic Programming , 1994 .

[23]  J. Quadrat Numerical methods for stochastic control problems in continuous time , 1994 .

[24]  Geoffrey J. Gordon Stable Function Approximation in Dynamic Programming , 1995, ICML.

[25]  Gerald Tesauro,et al.  Temporal difference learning and TD-Gammon , 1995, CACM.

[26]  Andrew G. Barto,et al.  Improving Elevator Performance Using Reinforcement Learning , 1995, NIPS.

[27]  Richard S. Sutton,et al.  Generalization in Reinforcement Learning: Successful Examples Using Sparse Coarse Coding , 1995, NIPS.

[28]  Andrew G. Barto,et al.  Learning to Act Using Real-Time Dynamic Programming , 1995, Artif. Intell..

[29]  Leemon C. Baird,et al.  Residual Algorithms: Reinforcement Learning with Function Approximation , 1995, ICML.

[30]  Andrew McCallum,et al.  Instance-Based Utile Distinctions for Reinforcement Learning with Hidden State , 1995, ICML.

[31]  Scott Davies,et al.  Multidimensional Triangulation and Interpolation for Reinforcement Learning , 1996, NIPS.

[32]  John Rust Numerical dynamic programming in economics , 1996 .

[33]  John N. Tsitsiklis,et al.  Neuro-Dynamic Programming , 1996, Encyclopedia of Machine Learning.

[34]  John N. Tsitsiklis,et al.  Analysis of Temporal-Diffference Learning with Function Approximation , 1996, NIPS.

[35]  L. Grüne An adaptive grid scheme for the discrete Hamilton-Jacobi-Bellman equation , 1997 .

[36]  Rémi Munos,et al.  Reinforcement Learning for Continuous Stochastic Control Problems , 1997, NIPS.

[37]  Gary Boone,et al.  Minimum-time control of the Acrobot , 1997, Proceedings of International Conference on Robotics and Automation.

[38]  Andrew W. Moore,et al.  Barycentric Interpolators for Continuous Space and Time Reinforcement Learning , 1998, NIPS.

[39]  P. Dupuis,et al.  Rates of Convergence for Approximation Schemes in Optimal Control , 1998 .

[40]  Andrew W. Moore,et al.  Gradient Descent for General Reinforcement Learning , 1998, NIPS.

[41]  Andrew W. Moore,et al.  Gradient descent approaches to neural-net-based solutions of the Hamilton-Jacobi-Bellman equation , 1999, IJCNN'99. International Joint Conference on Neural Networks. Proceedings (Cat. No.99CH36339).

[42]  Geoffrey J. Gordon,et al.  Approximate solutions to markov decision processes , 1999 .

[43]  Andrew W. Moore,et al.  Rates of Convergence for Variable Resolution Schemes in Optimal Control , 2000, ICML.

[44]  T. Sadahiro,et al.  Landing control of acrobat robot (SMB) satisfying various constraints , 2003, 2003 IEEE International Conference on Robotics and Automation (Cat. No.03CH37422).

[45]  Andrew W. Moore,et al.  The Parti-game Algorithm for Variable Resolution Reinforcement Learning in Multidimensional State-spaces , 1993, Machine Learning.

[46]  H. Bungartz,et al.  Sparse grids , 2004, Acta Numerica.

[47]  Andrew W. Moore,et al.  Prioritized sweeping: Reinforcement learning with less data and less time , 2004, Machine Learning.

[48]  Paul Bourgine,et al.  Exploration of Multi-State Environments: Local Measures and Back-Propagation of Uncertainty , 1999, Machine Learning.

[49]  Rémi Munos,et al.  A Study of Reinforcement Learning in the Continuous Case by the Means of Viscosity Solutions , 2000, Machine Learning.

[50]  Andrew W. Moore,et al.  Prioritized Sweeping: Reinforcement Learning with Less Data and Less Time , 1993, Machine Learning.