Internal-State Policy-Gradient Algorithms for Partially Observable Markov Decision Processes

Policy-gradient algorithms are attractive as a scalable approach to learning approximate policies for controlling partially observable Markov decision processes (POMDPs). POMDPs can be used to model a wide variety of learning problems, from robot navigation to speech recognition to stock trading. The downside of this generality is that exact algorithms are computationally intractable, motivating approximate methods. Existing policy-gradient methods have worked well for problems admitting good memory-less solutions, but have failed to scale to large problems requiring memory. This paper develops novel algorithms for learning policies with memory. We present a new policy-gradient algorithm that uses an explicit model of the POMDP to estimate gradients, and demonstrate its effectiveness on problems with tens of thousands of states. We also describe three new Monte-Carlo algorithms that learn by interacting with their environment. We compare these algorithms on non-trivial POMDPs, including noisy robot navigation and multi-agent settings.

[1]  Craig Boutilier,et al.  Computing Optimal Policies for Partially Observable Decision Processes Using Compact Representations , 1996, AAAI/IAAI, Vol. 2.

[2]  Daphne Koller,et al.  Reinforcement Learning Using Approximate Belief States , 1999, NIPS.

[3]  Ronald J. Williams,et al.  Simple Statistical Gradient-Following Algorithms for Connectionist Reinforcement Learning , 2004, Machine Learning.

[4]  John N. Tsitsiklis,et al.  The Complexity of Markov Decision Processes , 1987, Math. Oper. Res..

[5]  Alan Weiss,et al.  Sensitivity Analysis for Simulations via Likelihood Ratios , 1989, Oper. Res..

[6]  A. Poritz,et al.  Hidden Markov models: a guided tour , 1988, ICASSP-88., International Conference on Acoustics, Speech, and Signal Processing.

[7]  Leonid Peshkin,et al.  Learning from Scarce Experience , 2002, ICML.

[8]  Peter L. Bartlett,et al.  Estimation and Approximation Bounds for Gradient-Based Reinforcement Learning , 2000, J. Comput. Syst. Sci..

[9]  Michael I. Jordan,et al.  Learning Without State-Estimation in Partially Observable Markovian Decision Processes , 1994, ICML.

[10]  Christian R. Shelton,et al.  Policy Improvement for POMDPs Using Normalized Importance Sampling , 2001, UAI.

[11]  Shigenobu Kobayashi,et al.  Reinforcement learning for continuous action using stochastic gradient ascent , 1998 .

[12]  Michael I. Jordan,et al.  Reinforcement Learning Algorithm for Partially Observable Markov Decision Problems , 1994, NIPS.

[13]  Michael I. Jordan,et al.  Reinforcement Learning with Soft State Aggregation , 1994, NIPS.

[14]  Andrew W. Moore,et al.  Gradient Descent for General Reinforcement Learning , 1998, NIPS.

[15]  Katia P. Sycara,et al.  Evolutionary Search, Stochastic Policies with Memory, and Reinforcement Learning with Hidden State , 2001, ICML.

[16]  Peter W. Glynn,et al.  Likelihood ratio gradient estimation for stochastic systems , 1990, CACM.

[17]  Craig Boutilier,et al.  Vector-space Analysis of Belief-state Approximation for POMDPs , 2001, UAI.

[18]  Milos Hauskrecht,et al.  Incremental Methods for Computing Bounds in Partially Observable Markov Decision Processes , 1997, AAAI/IAAI.

[19]  Anne Greenbaum,et al.  Iterative methods for solving linear systems , 1997, Frontiers in applied mathematics.

[20]  Lonnie Chrisman,et al.  Reinforcement Learning with Perceptual Aliasing: The Perceptual Distinctions Approach , 1992, AAAI.

[21]  Amos Storkey,et al.  Advances in Neural Information Processing Systems 20 , 2007 .

[22]  Zhengzhu Feng,et al.  Dynamic Programming for POMDPs Using a Factored State Representation , 2000, AIPS.

[23]  Leslie Pack Kaelbling,et al.  Learning Policies with External Memory , 1999, ICML.

[24]  Milos Hauskrecht,et al.  Value-Function Approximations for Partially Observable Markov Decision Processes , 2000, J. Artif. Intell. Res..

[25]  Ronen I. Brafman,et al.  A Heuristic Variable Grid Solution Method for POMDPs , 1997, AAAI/IAAI.

[26]  P. Lanzi,et al.  Adaptive Agents with Reinforcement Learning and Internal Memory , 2000 .

[27]  Ilse C. F. Ipsen,et al.  THE IDEA BEHIND KRYLOV METHODS , 1998 .

[28]  Katsuhiko Ogata,et al.  Modern Control Engineering , 1970 .

[29]  Leslie Pack Kaelbling,et al.  Adaptive Importance Sampling for Estimation in Structured Domains , 2000, UAI.

[30]  Peter W. Glynn,et al.  Stochastic approximation for Monte Carlo optimization , 1986, WSC '86.

[31]  Lawrence R. Rabiner,et al.  A tutorial on hidden Markov models and selected applications in speech recognition , 1989, Proc. IEEE.

[32]  Andrew W. Moore,et al.  Reinforcement Learning: A Survey , 1996, J. Artif. Intell. Res..

[33]  P. Marbach Simulation-Based Methods for Markov Decision Processes , 1998 .

[34]  John N. Tsitsiklis,et al.  Actor-Critic Algorithms , 1999, NIPS.

[35]  Andrew McCallum,et al.  Reinforcement learning with selective perception and hidden state , 1996 .

[36]  Sebastian Thrun,et al.  Monte Carlo POMDPs , 1999, NIPS.

[37]  Illah R. Nourbakhsh,et al.  DERVISH - An Office-Navigating Robot , 1995, AI Mag..

[38]  Yishay Mansour,et al.  Policy Gradient Methods for Reinforcement Learning with Function Approximation , 1999, NIPS.

[39]  Wenju Liu,et al.  Planning in Stochastic Domains: Problem Characteristics and Approximation , 1996 .

[40]  Peter L. Bartlett,et al.  Variance Reduction Techniques for Gradient Estimates in Reinforcement Learning , 2001, J. Mach. Learn. Res..

[41]  Alan Weiss,et al.  Sensitivity analysis via likelihood ratios , 1986, WSC '86.

[42]  Peter L. Bartlett,et al.  Experiments with Infinite-Horizon, Policy-Gradient Estimation , 2001, J. Artif. Intell. Res..

[43]  John J. Grefenstette,et al.  Evolutionary Algorithms for Reinforcement Learning , 1999, J. Artif. Intell. Res..

[44]  Dana Ron,et al.  The Power of Amnesia , 1993, NIPS.

[45]  A. Cassandra,et al.  Exact and approximate algorithms for partially observable markov decision processes , 1998 .

[46]  Peter L. Bartlett,et al.  Infinite-Horizon Policy-Gradient Estimation , 2001, J. Artif. Intell. Res..

[47]  E. J. Sondik,et al.  The Optimal Control of Partially Observable Markov Decision Processes. , 1971 .

[48]  Craig Boutilier,et al.  Value-directed sampling methods for monitoring POMDPs , 2001, UAI 2001.

[49]  Brian Sallans,et al.  Learning Factored Representations for Partially Observable Markov Decision Processes , 1999, NIPS.

[50]  Shigenobu Kobayashi,et al.  Reinforcement Learning in POMDPs with Function Approximation , 1997, ICML.

[51]  Kee-Eung Kim,et al.  Solving POMDPs by Searching the Space of Finite Policies , 1999, UAI.

[52]  J. Douglas Faires,et al.  Numerical Analysis , 1981 .

[53]  Terrence L. Fine,et al.  Feedforward Neural Network Methodology , 1999, Information Science and Statistics.

[54]  Yoshua Bengio,et al.  Input-output HMMs for sequence processing , 1996, IEEE Trans. Neural Networks.

[55]  J. Tsitsiklis,et al.  Gradient-Based Optimization of Markov Reward Processes: Practical Variants , 2000 .

[56]  Leslie Pack Kaelbling,et al.  Learning Policies for Partially Observable Environments: Scaling Up , 1997, ICML.

[57]  Andrew McCallum,et al.  Maximum Entropy Markov Models for Information Extraction and Segmentation , 2000, ICML.

[58]  Stuart J. Russell,et al.  Approximating Optimal Policies for Partially Observable Stochastic Domains , 1995, IJCAI.

[59]  Kee-Eung Kim,et al.  Learning Finite-State Controllers for Partially Observable Environments , 1999, UAI.

[60]  Long Lin,et al.  Memory Approaches to Reinforcement Learning in Non-Markovian Domains , 1992 .