Policy Search using Paired Comparisons

Direct policy search is a practical way to solve reinforcement learning (RL) problems involving continuous state and action spaces. The goal becomes finding policy parameters that maximize a noisy objective function. The Pegasus method converts this stochastic optimization problem into a deterministic one, by using fixed start states and fixed random number sequences for comparing policies (Ng and Jordan, 2000). We evaluate Pegasus, and new paired comparison methods, using the mountain car problem, and a difficult pursuer-evader problem. We conclude that: (i) paired tests can improve performance of optimization procedures; (ii) several methods are available to reduce the 'overfitting' effect found with Pegasus; (iii) adapting the number of trials used for each comparison yields faster learning; (iv) pairing also helps stochastic search methods such as differential evolution.

[1]  John A. Nelder,et al.  A Simplex Method for Function Minimization , 1965, Comput. J..

[2]  J. S. Hunter,et al.  Statistics for experimenters : an introduction to design, data analysis, and model building , 1979 .

[3]  Andrew W. Moore,et al.  Efficient Algorithms for Minimizing Cross Validation Error , 1994, ICML.

[4]  Ben J. A. Kröse,et al.  Learning from delayed rewards , 1995, Robotics Auton. Syst..

[5]  Mance E. Harmon,et al.  Multi-Agent Residual Advantage Learning with General Function Approximation. , 1996 .

[6]  Margaret H. Wright,et al.  Direct search methods: Once scorned, now respectable , 1996 .

[7]  Astro Teller,et al.  Automatically Choosing the Number of Fitness Cases: The Rational Allocation of Trials , 1997 .

[8]  Rainer Storn,et al.  Differential Evolution – A Simple and Efficient Heuristic for global Optimization over Continuous Spaces , 1997, J. Glob. Optim..

[9]  R. Storn,et al.  Differential evolution a simple and efficient adaptive scheme for global optimization over continu , 1997 .

[10]  Andrew G. Barto,et al.  Reinforcement learning , 1998 .

[11]  M. J. D. Powell,et al.  Direct search algorithms for optimization calculations , 1998, Acta Numerica.

[12]  Jieyu Zhao,et al.  Direct Policy Search and Uncertain Policy Evaluation , 1998 .

[13]  Andrew W. Moore,et al.  Variable Resolution Discretization for High-Accuracy Solutions of Optimal Control Problems , 1999, IJCAI.

[14]  L. Baird Reinforcement Learning Through Gradient Descent , 1999 .

[15]  C. T. Kelley,et al.  Detection and Remediation of Stagnation in the Nelder--Mead Algorithm Using a Sufficient Decrease Condition , 1999, SIAM J. Optim..

[16]  Andrew Y. Ng,et al.  Policy Invariance Under Reward Transformations: Theory and Application to Reward Shaping , 1999, ICML.

[17]  Yishay Mansour,et al.  Policy Gradient Methods for Reinforcement Learning with Function Approximation , 1999, NIPS.

[18]  Michael I. Jordan,et al.  PEGASUS: A policy search method for large MDPs and POMDPs , 2000, UAI.

[19]  J. Baxter,et al.  Direct gradient-based reinforcement learning , 2000, 2000 IEEE International Symposium on Circuits and Systems. Emerging Technologies for the 21st Century. Proceedings (IEEE Cat No.00CH36353).

[20]  Andrew W. Moore,et al.  Learning Evaluation Functions to Improve Optimization by Local Search , 2001, J. Mach. Learn. Res..

[21]  Andrew W. Moore,et al.  Direct Policy Search using Paired Statistical Tests , 2001, ICML.

[22]  Isaac E. Lagaris,et al.  Training Reinforcement Neurocontrollers Using the Polytope Algorithm , 1998, Neural Processing Letters.