论文信息 - Simulation methods for uncertain decision-theoretic planning

Simulation methods for uncertain decision-theoretic planning

Experience based reinforcement learning (RL) systems are known to be useful for dealing with domains that are a priori unknown. We believe that experience based methods may also be useful when the model is uncertain (or even completely known). In this case experience is gained by simulating the uncertain model. This paper explores a simple way to allow experience based RL systems to cope with uncertainty in a model. The particular form of RL we consider is a policy-gradient method. The particular domains we attempt to optimise in are from temporal decision-theoretic planning. Our previous experience with military planning problems indicates that a human specified model of the planning problem is often inaccurate, especially when humans specify probabilities, thus planners that take into account this uncertainty are very useful. Despite our focus on policy-gradient RL for planning, our simple (but approximate) solution for dealing with uncertainty in the model can be applied to any simulation based RL method, such as Q-learning or SARSA. Our attempt to solve decision-theoretic planning problems with a policy-gradient approach is novel in itself, making up another contribution of this paper.

Olivier Buffet | Douglas Aberdeen | D. Aberdeen | O. Buffet | Douglas Aberdeen

[1] Iain Little,et al. Probabilistic Temporal Planning , 2005 .

[2] Håkan L. S. Younes,et al. Policy Generation for Continuous-time Stochastic Domains with Concurrency , 2004, ICAPS.

[3] Michael L. Littman,et al. An empirical evaluation of interval estimation for Markov decision processes , 2004, 16th IEEE International Conference on Tools with Artificial Intelligence.

[4] Robert Givan,et al. Bounded-parameter Markov decision processes , 2000, Artif. Intell..

[5] Kee-Eung Kim,et al. Learning to Cooperate via Policy Search , 2000, UAI.

[6] O. Buffet. Planning with Robust (L)RTDP , 2005 .

[7] Masanori Hosaka,et al. Controlled Markov set-chains under average criteria , 2001, Appl. Math. Comput..

[8] Lin Zhang,et al. Decision-Theoretic Military Operations Planning , 2004, ICAPS.

[9] Andrew G. Barto,et al. Learning to Act Using Real-Time Dynamic Programming , 1995, Artif. Intell..

[10] Mausam,et al. Concurrent Probabilistic Temporal Planning , 2005, ICAPS.

[11] Michael I. Jordan,et al. Learning Without State-Estimation in Partially Observable Markovian Decision Processes , 1994, ICML.

[12] Michael H. Bowling,et al. Convergence and No-Regret in Multiagent Learning , 2004, NIPS.

[13] Blai Bonet,et al. Labeled RTDP: Improving the Convergence of Real-Time Dynamic Programming , 2003, ICAPS.

[14] Peter L. Bartlett,et al. Experiments with Infinite-Horizon, Policy-Gradient Estimation , 2001, J. Artif. Intell. Res..