论文信息 - Value-Function Approximations for Partially Observable Markov Decision Processes

Value-Function Approximations for Partially Observable Markov Decision Processes

Partially observable Markov decision processes (POMDPs) provide an elegant mathematical framework for modeling complex decision and planning problems in stochastic domains in which states of the system are observable only indirectly, via a set of imperfect or noisy observations. The modeling advantage of POMDPs, however, comes at a price -- exact methods for solving them are computationally very expensive and thus applicable in practice only to very simple problems. We focus on efficient approximation (heuristic) methods that attempt to alleviate the computational problem and trade off accuracy for speed. We have two objectives here. First, we survey various approximation methods, analyze their properties and relations and provide some new insights into their differences. Second, we present a number of new approximation methods and novel refinements of existing techniques. The theoretical results are supported by experiments on a problem from the agent navigation domain.

Milos Hauskrecht | M. Hauskrecht

[1] Ronald A. Howard,et al. Dynamic Programming and Markov Processes , 1960 .

[2] Alvin W Drake,et al. Observation of a Markov process through a noisy channel , 1962 .

[3] Karl Johan Åström,et al. Optimal control of Markov processes with incomplete state information , 1965 .

[4] Howard Raiffa,et al. Decision analysis: introductory lectures on choices under uncertainty. 1968. , 1969, M.D.Computing.

[5] E. J. Sondik,et al. The Optimal Control of Partially Observable Markov Decision Processes. , 1971 .

[6] H. Raiffa,et al. Decision analysis: introductory lectures on choices under uncertainty. 1968. , 1969, M.D. computing : computers in medical practice.

[7] J. Satia,et al. Markovian Decision Processes with Probabilistic Observation of States , 1973 .

[8] Edward J. Sondik,et al. The Optimal Control of Partially Observable Markov Processes over a Finite Horizon , 1973, Oper. Res..

[9] Loren K. Platzman,et al. Finite memory estimation and control of finite probabilistic systems , 1977 .

[10] Edward J. Sondik,et al. The Optimal Control of Partially Observable Markov Processes over the Infinite Horizon: Discounted Costs , 1978, Oper. Res..

[11] James N. Eagle. The Optimal Search for a Moving Target When the Search Path Is Constrained , 1984, Oper. Res..

[12] E. KorfRichard. Depth-first iterative-deepening: an optimal admissible tree search , 1985 .

[13] Richard E. Korf,et al. Depth-First Iterative-Deepening: An Optimal Admissible Tree Search , 1985, Artif. Intell..

[14] Geoffrey E. Hinton,et al. Learning internal representations by error propagation , 1986 .

[15] H. Brachinger,et al. Decision analysis , 1997 .

[16] John N. Tsitsiklis,et al. The Complexity of Markov Decision Processes , 1987, Math. Oper. Res..

[17] Judea Pearl,et al. Probabilistic reasoning in intelligent systems , 1988 .

[18] Judea Pearl,et al. Probabilistic reasoning in intelligent systems - networks of plausible inference , 1991, Morgan Kaufmann series in representation and reasoning.

[19] Keiji Kanazawa,et al. A model for reasoning about persistence and causation , 1989 .

[20] Ross D. Shachter,et al. Dynamic programming and influence diagrams , 1990, IEEE Trans. Syst. Man Cybern..

[21] William S. Lovejoy,et al. Computationally Feasible Bounds for Partially Observed Markov Decision Processes , 1991, Oper. Res..

[22] W. Lovejoy. A survey of algorithmic methods for partially observed Markov decision processes , 1991 .

[23] Anne Condon,et al. The Complexity of Stochastic Games , 1992, Inf. Comput..

[24] Uffe Kjærulff,et al. A Computational Scheme for Reasoning in Dynamic Probabilistic Networks , 1992, UAI.

[25] William S. Lovejoy,et al. Suboptimal Policies, with Bounds, for Parameter Adaptive Decision Processes , 1993, Oper. Res..

[26] Ronald J. Williams,et al. Tight Performance Bounds on Greedy Policies Based on Imperfect Value Functions , 1993 .

[27] Andrew W. Moore,et al. Generalization in Reinforcement Learning: Safely Approximating the Value Function , 1994, NIPS.

[28] Michael L. Littman,et al. Memoryless policies: theoretical limitations and practical results , 1994 .

[29] Daniel S. Weld,et al. Probabilistic Planning with Information Gathering and Contingent Execution , 1994, AIPS.

[30] Martin L. Puterman,et al. Markov Decision Processes: Discrete Stochastic Dynamic Programming , 1994 .

[31] Chelsea C. White,et al. Finite-Memory Suboptimal Design for Partially Observed Markov Decision Processes , 1994, Oper. Res..

[32] Michael I. Jordan,et al. Learning Without State-Estimation in Partially Observable Markovian Decision Processes , 1994, ICML.

[33] Geoffrey J. Gordon. Stable Function Approximation in Dynamic Programming , 1995, ICML.

[34] Nicholas Kushmerick,et al. An Algorithm for Probabilistic Planning , 1995, Artif. Intell..

[35] Stuart J. Russell,et al. Stochastic simulation algorithms for dynamic probabilistic networks , 1995, UAI.

[36] Stuart J. Russell,et al. Approximating Optimal Policies for Partially Observable Stochastic Domains , 1995, IJCAI.

[37] Dimitri P. Bertsekas,et al. A Counterexample to Temporal Differences Learning , 1995, Neural Computation.

[38] Leslie Pack Kaelbling,et al. Learning Policies for Partially Observable Environments: Scaling Up , 1997, ICML.

[39] Dimitri P. Bertsekas,et al. Dynamic Programming and Optimal Control, Two Volume Set , 1995 .

[40] Craig Boutilier,et al. Exploiting Structure in Policy Construction , 1995, IJCAI.

[41] Andrew G. Barto,et al. Learning to Act Using Real-Time Dynamic Programming , 1995, Artif. Intell..

[42] Leemon C. Baird,et al. Residual Algorithms: Reinforcement Learning with Function Approximation , 1995, ICML.

[43] Andrew McCallum,et al. Instance-Based Utile Distinctions for Reinforcement Learning with Hidden State , 1995, ICML.

[44] Michel de Rougemont,et al. On the Complexity of Partially Observed Markov Decision Processes , 1996, Theor. Comput. Sci..

[45] Richard Washington,et al. Incremental Markov-model planning , 1996, Proceedings Eighth IEEE International Conference on Tools with Artificial Intelligence.

[46] Michael Isard,et al. Contour Tracking by Stochastic Propagation of Conditional Density , 1996, ECCV.

[47] Michael L. Littman,et al. Algorithms for Sequential Decision Making , 1996 .

[48] Wenju Liu,et al. A Model Approximation Scheme for Planning in Partially Observable Stochastic Domains , 1997, J. Artif. Intell. Res..

[49] Eric A. Hansen,et al. An Improved Policy Iteration Algorithm for Partially Observable MDPs , 1997, NIPS.

[50] Milos Hauskrecht,et al. Planning and control in stochastic domains with imperfect information , 1997 .

[51] E. Allender,et al. Encyclopaedia of Complexity Results for Finite-Horizon Markov Decision Process Problems , 1997 .

[52] Craig Boutilier,et al. Abstraction and Approximate Decision-Theoretic Planning , 1997, Artif. Intell..

[53] Michael L. Littman,et al. Incremental Pruning: A Simple, Fast, Exact Method for Partially Observable Markov Decision Processes , 1997, UAI.

[54] D. Castañón. Approximate dynamic programming for sensor management , 1997, Proceedings of the 36th IEEE Conference on Decision and Control.

[55] Ronen I. Brafman,et al. A Heuristic Variable Grid Solution Method for POMDPs , 1997, AAAI/IAAI.

[56] Wenju Liu,et al. Region-Based Approximations for Planning in Stochastic Domains , 1997, UAI.

[57] Simon J. Godsill,et al. On sequential simulation-based methods for Bayesian filtering , 1998 .