On Evaluating Agent Performance in a Fixed Period of Time (Extended Version)

The evaluation of several policies, individuals, systems or subjects (in what follows, agents) over a given task in a finite period of time is a very common problem in experimental design, statistics, computer science, economics and, in general, any experimental science. It is also crucial for measuring intelligence. When the agents have a feedback information to adjust their performance, we face a more specific (but still very broad) problem which is frequent in control, robotics and artificial intelligence. In reinforcement learning, the task is formalised as an interactive environment and feedback is represented by reward values, so allowing these problems to be properly modelled. In this and related areas (such as Markov Decision Processes), several performance measures have been derived to evaluate the goodness of an agent in an environment. Typically, the decision that has to be made by the agent is a choice among a set of actions, cycle after cycle. However, in real evaluation scenarios, the time can be intentionally modulated by the agent. Consequently, agents not only choose an action but they also choose the time when they want to perform an action. This is natural in biological systems but it is also an issue in control (some decisions must be made quickly and some other decisions can take more time). In this paper, we revisit the classical reward aggregating (payoff) functions which are commonly used in reinforcement learning and related areas, we analyse the problems of each of them, and we propose two new modifications of the average reward to get a consistent measurement for continuous time, where the agent not only decides an action to perform but also decides the time the decision is going to take.

[1]  H. Robbins,et al.  On optimal stopping rules for $S_{n}/n$ , 1965 .

[2]  Anton Schwartz,et al.  A Reinforcement Learning Method for Maximizing Undiscounted Rewards , 1993, ICML.

[3]  Martin L. Puterman,et al.  Markov Decision Processes: Discrete Stochastic Dynamic Programming , 1994 .

[4]  Dimitri P. Bertsekas,et al.  Dynamic Programming and Optimal Control, Two Volume Set , 1995 .

[5]  Andrew W. Moore,et al.  Reinforcement Learning: A Survey , 1996, J. Artif. Intell. Res..

[6]  David L. Dowe,et al.  A computational extension to the Turing test , 1997 .

[7]  Prasad Tadepalli,et al.  Model-Based Average Reward Reinforcement Learning , 1998, Artif. Intell..

[8]  David L. Dowe,et al.  A Non-Behavioural, Computational Extension to the Turing Test , 1998 .

[9]  José Hernández-Orallo,et al.  Beyond the Turing Test , 2000, J. Log. Lang. Inf..

[10]  J. Hernández-Orallo Constructive reinforcement learning , 2000 .

[11]  José Hernández-Orallo,et al.  Thesis: Computational measures of information gain and reinforcement in inference processes , 2000, AI Commun..

[12]  Manindra Agrawal,et al.  PRIMES is in P , 2004 .

[13]  Sridhar Mahadevan,et al.  Average reward reinforcement learning: Foundations, algorithms, and empirical results , 2004, Machine Learning.

[14]  Richard S. Sutton,et al.  Reinforcement Learning: An Introduction , 1998, IEEE Trans. Neural Networks.

[15]  SRIDHAR MAHADEVAN,et al.  Average Reward Reinforcement Learning: Foundations, Algorithms, and Empirical Results , 2005, Machine Learning.

[16]  Marcus Hutter General Discounting Versus Average Reward , 2006, ALT.

[17]  M. Tomasello,et al.  Humans Have Evolved Specialized Skills of Social Cognition: The Cultural Intelligence Hypothesis , 2007, Science.

[18]  Shane Legg,et al.  Universal Intelligence: A Definition of Machine Intelligence , 2007, Minds and Machines.

[19]  José Hernández-Orallo A (hopefully) Unbiased Universal Environment Class for Measuring Intelligence of Biological and Artificial Systems , 2009, AGI 2010.

[20]  Martin Herdegen Optimal Stopping and Applications Example 2 : American options , 2009 .

[21]  U. Rieder,et al.  Markov Decision Processes , 2010 .

[22]  José Hernández-Orallo,et al.  Measuring universal intelligence: Towards an anytime intelligence test , 2010, Artif. Intell..

[23]  José Hernández-Orallo,et al.  On the Computational Measurement of Intelligence Factors , 2011 .