论文信息 - Basis Function Adaptation in Temporal Difference Reinforcement Learning

Basis Function Adaptation in Temporal Difference Reinforcement Learning

Reinforcement Learning (RL) is an approach for solving complex multi-stage decision problems that fall under the general framework of Markov Decision Problems (MDPs), with possibly unknown parameters. Function approximation is essential for problems with a large state space, as it facilitates compact representation and enables generalization. Linear approximation architectures (where the adjustable parameters are the weights of pre-fixed basis functions) have recently gained prominence due to efficient algorithms and convergence guarantees. Nonetheless, an appropriate choice of basis function is important for the success of the algorithm. In the present paper we examine methods for adapting the basis function during the learning process in the context of evaluating the value function under a fixed control policy. Using the Bellman approximation error as an optimization criterion, we optimize the weights of the basis function while simultaneously adapting the (non-linear) basis function parameters. We present two algorithms for this problem. The first uses a gradient-based approach and the second applies the Cross Entropy method. The performance of the proposed algorithms is evaluated and compared in simulations.

[1] Ian H. Witten,et al. An Adaptive Optimal Controller for Discrete-Time Markov Environments , 1977, Inf. Control..

[2] Richard S. Sutton,et al. Neuronlike adaptive elements that can solve difficult learning control problems , 1983, IEEE Transactions on Systems, Man, and Cybernetics.

[3] Paul J. Werbos,et al. Consistency of HDP applied to a simple reinforcement learning problem , 1990, Neural Networks.

[4] Steven J. Bradtke,et al. Reinforcement Learning Applied to Linear Quadratic Regulation , 1992, NIPS.

[5] S. Hyakin,et al. Neural Networks: A Comprehensive Foundation , 1994 .

[6] Martin L. Puterman,et al. Markov Decision Processes: Discrete Stochastic Dynamic Programming , 1994 .

[7] Michael I. Jordan,et al. Reinforcement Learning with Soft State Aggregation , 1994, NIPS.

[8] Peter Auer,et al. Exponentially many local minima for single neurons , 1995, NIPS.

[9] Dimitri P. Bertsekas,et al. Dynamic Programming and Optimal Control, Two Volume Set , 1995 .

[10] Andrew W. Moore,et al. Reinforcement Learning: A Survey , 1996, J. Artif. Intell. Res..

[11] G. McLachlan,et al. The EM algorithm and extensions , 1996 .

[12] John N. Tsitsiklis,et al. Neuro-Dynamic Programming , 1996, Encyclopedia of Machine Learning.

[13] John N. Tsitsiklis,et al. Analysis of Temporal-Diffference Learning with Function Approximation , 1996, NIPS.

[14] R. Rubinstein. The Cross-Entropy Method for Combinatorial and Continuous Optimization , 1999 .

[15] Andrew G. Barto,et al. Automatic Discovery of Subgoals in Reinforcement Learning using Diverse Density , 2001, ICML.

[16] Bjarne E. Helvik,et al. Using the Cross-Entropy Method to Guide/Govern Mobile Agent's Path Finding in Networks , 2001, MATA.

[17] Joydeep Ghosh,et al. An overview of radial basis function networks , 2001 .

[18] Michail G. Lagoudakis,et al. Model-Free Least-Squares Policy Iteration , 2001, NIPS.

[19] Doina Precup,et al. Characterizing Markov Decision Processes , 2002, ECML.

[20] Shie Mannor,et al. Q-Cut - Dynamic Discovery of Sub-goals in Reinforcement Learning , 2002, ECML.

[21] Rémi Munos,et al. Error Bounds for Approximate Policy Iteration , 2003, ICML.

[22] Pawel Strumillo,et al. Radial Basis Function Neural Networks: Theory and Applications , 2003 .

[23] Dimitri P. Bertsekas,et al. Least Squares Policy Evaluation Algorithms with Linear Function Approximation , 2003, Discret. Event Dyn. Syst..

[24] Shie Mannor,et al. The Cross Entropy Method for Fast Policy Search , 2003, ICML.

[25] Steven J. Bradtke,et al. Linear Least-Squares algorithms for temporal difference learning , 2004, Machine Learning.

[26] Dirk P. Kroese,et al. The Cross-Entropy Method: A Unified Approach to Combinatorial Optimization, Monte-Carlo Simulation and Machine Learning , 2004 .

[27] Dirk P. Kroese,et al. Cross‐Entropy Method , 2011 .

[28] John N. Tsitsiklis,et al. Feature-based methods for large scale dynamic programming , 2004, Machine Learning.

[29] Justin A. Boyan,et al. Technical Update: Least-Squares Temporal Difference Learning , 2002, Machine Learning.

[30] Dirk P. Kroese,et al. Application of the Cross-Entropy Method to the Buffer Allocation Problem in a Simulation-Based Environment , 2005, Ann. Oper. Res..

[31] Richard S. Sutton,et al. Reinforcement Learning: An Introduction , 1998, IEEE Trans. Neural Networks.

[32] Richard S. Sutton,et al. Learning to predict by the methods of temporal differences , 1988, Machine Learning.

[33] Shie Mannor,et al. A Tutorial on the Cross-Entropy Method , 2005, Ann. Oper. Res..