Approximate Dynamic Programming

In any complex or large scale sequential decision making problem, there is a crucial need to use function approximation to represent the relevant functions such as the value function or the policy. The Dynamic Programming (DP) and Reinforcement Learning (RL) methods introduced in previous chapters make the implicit assumption that the value function can be perfectly represented (i.e. kept in memory), for example by using a look-up table (with a finite number of entries) assigning a value to all possible states (assumed to be finite) of the system. Those methods are called exact because they provide an exact computation of the optimal solution of the considered problem (or at least, enable the computations to converge to this optimal solution). However, such methods often apply to toy problems only, since in most interesting applications, the number of possible states is so large (and possibly infinite if we consider continuous spaces) that a perfect representation of the function at all states is impossible. It becomes necessary to approximate the function by using a moderate number of coefficients (which can be stored in a computer), and therefore extend the range of DP and RL to methods using such approximate representations. These approximate methods combine DP and RL methods with function approximation tools.

[1]  Olivier Sigaud,et al.  Learning the structure of Factored Markov Decision Processes in reinforcement learning problems , 2006, ICML.

[2]  Shie Mannor,et al.  Regularized Fitted Q-Iteration for planning in continuous-space Markovian decision problems , 2009, 2009 American Control Conference.

[3]  Martin L. Puterman,et al.  Markov Decision Processes: Discrete Stochastic Dynamic Programming , 1994 .

[4]  Adam Krzyzak,et al.  A Distribution-Free Theory of Nonparametric Regression , 2002, Springer series in statistics.

[5]  Rémi Munos,et al.  Performance Bounds in Lp-norm for Approximate Value Iteration , 2007, SIAM J. Control. Optim..

[6]  Rémi Munos,et al.  Error Bounds for Approximate Policy Iteration , 2003, ICML.

[7]  Richard S. Sutton,et al.  Learning to predict by the methods of temporal differences , 1988, Machine Learning.

[8]  John N. Tsitsiklis,et al.  Feature-based methods for large scale dynamic programming , 2004, Machine Learning.

[9]  Justin A. Boyan,et al.  Least-Squares Temporal Difference Learning , 1999, ICML.

[10]  Stuart J. Russell,et al.  Bayesian Q-Learning , 1998, AAAI/IAAI.

[11]  S. Mallat,et al.  Adaptive greedy approximations , 1997 .

[12]  Rémi Munos,et al.  Geometric Variance Reduction in Markov Chains: Application to Value Function and Gradient Estimation , 2005, J. Mach. Learn. Res..

[13]  Michail G. Lagoudakis,et al.  Least-Squares Policy Iteration , 2003, J. Mach. Learn. Res..

[14]  David Haussler,et al.  Sphere Packing Numbers for Subsets of the Boolean n-Cube with Bounded Vapnik-Chervonenkis Dimension , 1995, J. Comb. Theory, Ser. A.

[15]  Andrew W. Moore,et al.  Locally Weighted Learning , 1997, Artificial Intelligence Review.

[16]  Csaba Szepesvári,et al.  Learning near-optimal policies with Bellman-residual minimization based fitted policy iteration and a single sample path , 2006, Machine Learning.

[17]  Arthur L. Samuel,et al.  Some studies in machine learning using the game of checkers , 2000, IBM J. Res. Dev..

[18]  Steven J. Bradtke,et al.  Linear Least-Squares algorithms for temporal difference learning , 2004, Machine Learning.

[19]  Csaba Szepesvári,et al.  Finite-Time Bounds for Fitted Value Iteration , 2008, J. Mach. Learn. Res..

[20]  Andrew G. Barto,et al.  Improving Elevator Performance Using Reinforcement Learning , 1995, NIPS.

[21]  Gerald Tesauro,et al.  Temporal difference learning and TD-Gammon , 1995, CACM.

[22]  Leslie G. Valiant,et al.  A theory of the learnable , 1984, STOC '84.

[23]  Pierre Geurts,et al.  Tree-Based Batch Mode Reinforcement Learning , 2005, J. Mach. Learn. Res..

[24]  A. L. Samuel,et al.  Some studies in machine learning using the game of checkers. II: recent progress , 1967 .