Planning Delayed-Response Queries and Transient Policies under Reward Uncertainty

We address situations in which an agent with uncertainty in rewards can selectively query another agent/human to improve its knowledge of rewards and thus its policy. When there is a time delay between posing the query and receiving the response, the agent must determine how to behave in the transient phase while waiting for the response. Thus, in order to act optimally the agent must jointly optimize its transient policy along with its query. In this paper, we formalize the aforementioned joint optimization problem and provide a new algorithm called JQTP for optimizing the Joint Query and Transient Policy. In addition, we provide a clustering technique that can be used in JQTP to flexibly trade performance for reduced computation. We illustrate our algorithms on a machine configuration task.

[1]  Ronald J. Williams,et al.  Tight Performance Bounds on Greedy Policies Based on Imperfect Value Functions , 1993 .

[2]  Eyal Amir,et al.  Bayesian Inverse Reinforcement Learning , 2007, IJCAI.

[3]  Joelle Pineau,et al.  Reinforcement learning with limited reinforcement: using Bayes risk for active learning in POMDPs , 2008, ICML '08.

[4]  Joelle Pineau,et al.  Efficient Planning and Tracking in POMDPs with Large Observation Spaces , 2006 .

[5]  Satinder Singh,et al.  An upper bound on the loss from approximate optimal-value functions , 1994, Machine Learning.

[6]  Malcolm J. A. Strens,et al.  A Bayesian Framework for Reinforcement Learning , 2000, ICML.

[7]  Claudia V. Goldman,et al.  Transition-independent decentralized markov decision processes , 2003, AAMAS '03.

[8]  Sridhar Mahadevan,et al.  Recent Advances in Hierarchical Reinforcement Learning , 2003, Discret. Event Dyn. Syst..

[9]  Edmund H. Durfee,et al.  Selecting Operator Queries Using Expected Myopic Gain , 2010, 2010 IEEE/WIC/ACM International Conference on Web Intelligence and Intelligent Agent Technology.

[10]  Manuela Veloso,et al.  What to Communicate? Execution-Time Decision in Multi-agent POMDPs , 2006, DARS.

[11]  Nikos A. Vlassis,et al.  Multiagent Planning Under Uncertainty with Stochastic Communication Delays , 2008, ICAPS.

[12]  Edmund H. Durfee,et al.  Comparing Action-Query Strategies in Semi-Autonomous Agents , 2011, AAAI.

[13]  Joelle Pineau,et al.  Anytime Point-Based Approximations for Large POMDPs , 2006, J. Artif. Intell. Res..

[14]  Mausam,et al.  Planning with Durative Actions in Stochastic Domains , 2008, J. Artif. Intell. Res..

[15]  Doina Precup,et al.  Between MDPs and Semi-MDPs: A Framework for Temporal Abstraction in Reinforcement Learning , 1999, Artif. Intell..

[16]  Edmund H. Durfee,et al.  Influence-Based Policy Abstraction for Weakly-Coupled Dec-POMDPs , 2010, ICAPS.