Reinforcement Learning with a Corrupted Reward Channel

No real-world reward function is perfect. Sensory errors and software bugs may result in RL agents observing higher (or lower) rewards than they should. For example, a reinforcement learning agent may prefer states where a sensory error gives it the maximum reward, but where the true reward is actually small. We formalise this problem as a generalised Markov Decision Problem called Corrupt Reward MDP. Traditional RL methods fare poorly in CRMDPs, even under strong simplifying assumptions and when trying to compensate for the possibly corrupt rewards. Two ways around the problem are investigated. First, by giving the agent richer data, such as in inverse reinforcement learning and semi-supervised reinforcement learning, reward corruption stemming from systematic sensory errors may sometimes be completely managed. Second, by using randomisation to blunt the agent's optimisation, reward corruption can be partially managed under some assumptions.

[1]  Ming Li,et al.  Average Case Complexity Under the Universal Distribution Equals Worst-Case Complexity , 1992, Inf. Process. Lett..

[2]  Noah D. Goodman,et al.  Learning the Preferences of Ignorant, Inconsistent Agents , 2015, AAAI.

[3]  David H. Wolpert,et al.  No free lunch theorems for optimization , 1997, IEEE Trans. Evol. Comput..

[4]  Jessica Taylor,et al.  Quantilizers: A Safer Alternative to Maximizers for Limited Optimization , 2016, AAAI Workshop: AI, Ethics, and Society.

[5]  Roman V. Yampolskiy,et al.  Utility function security in artificially intelligent agents , 2014, J. Exp. Theor. Artif. Intell..

[6]  Anca D. Dragan,et al.  The Off-Switch Game , 2016, IJCAI.

[7]  Marcus Hutter,et al.  Universal Artificial Intellegence - Sequential Decisions Based on Algorithmic Probability , 2005, Texts in Theoretical Computer Science. An EATCS Series.

[8]  Martin L. Puterman,et al.  Markov Decision Processes: Discrete Stochastic Dynamic Programming , 1994 .

[9]  Peter Auer,et al.  Near-optimal Regret Bounds for Reinforcement Learning , 2008, J. Mach. Learn. Res..

[10]  C. Robert Superintelligence: Paths, Dangers, Strategies , 2017 .

[11]  Richard S. Sutton,et al.  Reinforcement Learning: An Introduction , 1998, IEEE Trans. Neural Networks.

[12]  P. W. Jones,et al.  Bandit Problems, Sequential Allocation of Experiments , 1987 .

[13]  Marcus Hutter,et al.  Universal Reinforcement Learning Algorithms: Survey and Experiments , 2017, IJCAI.

[14]  Anca D. Dragan,et al.  Cooperative Inverse Reinforcement Learning , 2016, NIPS.

[15]  Michael I. Jordan,et al.  MASSACHUSETTS INSTITUTE OF TECHNOLOGY ARTIFICIAL INTELLIGENCE LABORATORY and CENTER FOR BIOLOGICAL AND COMPUTATIONAL LEARNING DEPARTMENT OF BRAIN AND COGNITIVE SCIENCES , 1996 .

[16]  Laurent Orseau,et al.  Delusion, Survival, and Intelligent Agents , 2011, AGI.

[17]  Mark O. Riedl,et al.  Using Stories to Teach Human Values to Artificial Agents , 2016, AAAI Workshop: AI, Ethics, and Society.

[18]  Andrew Y. Ng,et al.  Pharmacokinetics of a novel formulation of ivermectin after administration to goats , 2000, ICML.

[19]  Leslie Pack Kaelbling,et al.  Planning and Acting in Partially Observable Stochastic Domains , 1998, Artif. Intell..