论文信息 - Improving Policies without Measuring Merits

Improving Policies without Measuring Merits

Performing policy iteration in dynamic programming should only require knowledge of relative rather than absolute measures of the utility of actions (Werbos, 1991) - what Baird (1993) calls the advantages of actions at states. Nevertheless, most existing methods in dynamic programming (including Baird's) compute some form of absolute utility function. For smooth problems, advantages satisfy two differential consistency conditions (including the requirement that they be free of curl), and we show that enforcing these can lead to appropriate policy improvement solely in terms of advantages.

Peter Dayan | Satinder P. Singh | Satinder Singh | P. Dayan

[1] Ronald A. Howard,et al. Dynamic Programming and Markov Processes , 1960 .

[2] Mary W. Cooper,et al. Dynamic Programming and the Calculus of Variations , 1981 .

[3] David S. Broomhead,et al. Multivariable Functional Interpolation and Adaptive Networks , 1988, Complex Syst..

[4] D. Broomhead,et al. Radial Basis Functions, Multi-Variable Functional Interpolation and Adaptive Networks , 1988 .

[5] C. Watkins. Learning from delayed rewards , 1989 .

[6] A. Barto,et al. Learning and Sequential Decision Making , 1989 .

[7] Barbara Moore,et al. Theory of networks for learning , 1990, Defense, Security, and Sensing.

[8] T Poggio,et al. Regularization Algorithms for Learning That Are Equivalent to Multilayer Networks , 1990, Science.

[9] Sebastian Thrun,et al. Explanation-Based Neural Network Learning for Robot Control , 1992, NIPS.

[10] Christopher G. Atkeson,et al. Using Local Trajectory Optimizers to Speed Up Global Optimization in Dynamic Programming , 1993, NIPS.

[11] Richard S. Sutton,et al. A Menu of Designs for Reinforcement Learning Over Time , 1995 .

[12] Andrew G. Barto,et al. Learning to Act Using Real-Time Dynamic Programming , 1995, Artif. Intell..