Protecting against evaluation overfitting in empirical reinforcement learning

Empirical evaluations play an important role in machine learning. However, the usefulness of any evaluation depends on the empirical methodology employed. Designing good empirical methodologies is difficult in part because agents can overfit test evaluations and thereby obtain misleadingly high scores. We argue that reinforcement learning is particularly vulnerable to environment overfitting and propose as a remedy generalized methodologies, in which evaluations are based on multiple environments sampled from a distribution. In addition, we consider how to summarize performance when scores from different environments may not have commensurate values. Finally, we present proof-of-concept results demonstrating how these methodologies can validate an intuitively useful range-adaptive tile coding method.

[1]  A. Mood,et al.  The statistical sign test. , 1946, Journal of the American Statistical Association.

[2]  R. Bellman A Markovian Decision Process , 1957 .

[3]  Adele E. Howe,et al.  How evaluation guides AI research , 1988 .

[4]  Fred S. Roberts,et al.  Chapter 18 Limitations on conclusions using scales of measurement , 1994, Operations research and the public sector.

[5]  Herbert A. Simon,et al.  Artificial Intelligence: An Empirical Science , 1995, Artif. Intell..

[6]  Richard S. Sutton,et al.  Generalization in Reinforcement Learning: Successful Examples Using Sparse Coarse Coding , 1995, NIPS.

[7]  John N. Hooker,et al.  Testing heuristics: We have it all wrong , 1995, J. Heuristics.

[8]  D. Saari,et al.  The Copeland method , 1996 .

[9]  San Cristóbal Mateo,et al.  The Lack of A Priori Distinctions Between Learning Algorithms , 1996 .

[10]  Emanuel Falkenauer,et al.  On Method Overfitting , 1998, J. Heuristics.

[11]  C. Gallistel The Replacement of General-Purpose Learning Models with Adaptively Specialized Learning Modules , 2000 .

[12]  Jonathan Baxter,et al.  A Model of Inductive Bias Learning , 2000, J. Artif. Intell. Res..

[13]  Tom Fawcett,et al.  Robust Classification for Imprecise Environments , 2000, Machine Learning.

[14]  Pat Langley,et al.  Machine learning as an experimental science , 2004, Machine Learning.

[15]  Steven Salzberg,et al.  On Comparing Classifiers: Pitfalls to Avoid and a Recommended Approach , 1997, Data Mining and Knowledge Discovery.

[16]  Geoffrey I. Webb MultiBoosting: A Technique for Combining Boosting and Wagging , 2000, Machine Learning.

[17]  David A. Cohn,et al.  Improving generalization with active learning , 1994, Machine Learning.

[18]  W. Smart,et al.  Why (PO)MDPs Lose for Spatial Tasks and What to Do About It , 2005 .

[19]  Andrew G. Barto,et al.  Autonomous shaping: knowledge transfer in reinforcement learning , 2006, ICML.

[20]  Jesse Hoey,et al.  An analytic solution to discrete Bayesian reinforcement learning , 2006, ICML.

[21]  Janez Demsar,et al.  Statistical Comparisons of Classifiers over Multiple Data Sets , 2006, J. Mach. Learn. Res..

[22]  Cordelia Schmid,et al.  Dataset Issues in Object Recognition , 2006, Toward Category-Level Object Recognition.

[23]  Alan Fern,et al.  Multi-task reinforcement learning: a hierarchical Bayesian approach , 2007, ICML '07.

[24]  Peter Stone,et al.  Transfer Learning for Reinforcement Learning Domains: A Survey , 2009, J. Mach. Learn. Res..

[25]  Brian Tanner,et al.  RL-Glue: Language-Independent Software for Reinforcement-Learning Experiments , 2009, J. Mach. Learn. Res..

[26]  Yehuda Koren,et al.  The BellKor Solution to the Netflix Grand Prize , 2009 .

[27]  Shimon Whiteson,et al.  Neuroevolutionary reinforcement learning for generalized helicopter control , 2009, GECCO.

[28]  Ronald E. Parr,et al.  A Novel Benchmark Methodology and Data Repository for Real-life Reinforcement Learning , 2009 .

[29]  Shimon Whiteson,et al.  The Reinforcement Learning Competitions , 2010 .

[30]  Shimon Whiteson,et al.  Multi-task evolutionary shaping without pre-specified representations , 2010, GECCO '10.

[31]  Nathalie Japkowicz,et al.  Warning: statistical benchmarking is addictive. Kicking the habit in machine learning , 2010, J. Exp. Theor. Artif. Intell..

[32]  Ahmed Syed Irshad,et al.  Markov Decision Process , 2011 .