Learning Finite-State Controllers for Partially Observable Environments

Reactive (memoryless) policies are sufficient in completely observable Markov decision processes (MDPS), but some kind of memory is usually necessary for optimal control of a partially observable MDP. Policies with finite memory can be represented as finite-state automata. In this paper, we extend Baird and Moore's VAPS algorithm to the problem of learning general finite-state automata. Because it performs stochastic gradient descent, this algorithm can be shown to converge to a locally optimal finitestate controller. We provide the details of the algorithm and then consider the question of under what conditions stochastic gradient descent will outperform exact gradient descent. We conclude with empirical results comparing the performance of stochastic and exact gradient descent, and showing the ability of our algorithm to extract the useful information contained in the sequence of past observations to compensate for the lack of observability at each time-step.

[1]  Ronald A. Howard,et al.  Dynamic Programming and Markov Processes , 1960 .

[2]  Karl Johan Åström,et al.  Optimal control of Markov processes with incomplete state information , 1965 .

[3]  E. J. Sondik,et al.  The Optimal Control of Partially Observable Markov Decision Processes. , 1971 .

[4]  J. Satia,et al.  Markovian Decision Processes with Probabilistic Observation of States , 1973 .

[5]  Edward J. Sondik,et al.  The Optimal Control of Partially Observable Markov Processes over a Finite Horizon , 1973, Oper. Res..

[6]  Edward J. Sondik,et al.  The Optimal Control of Partially Observable Markov Processes over the Infinite Horizon: Discounted Costs , 1978, Oper. Res..

[7]  Andrew McCallum,et al.  Overcoming Incomplete Perception with Utile Distinction Memory , 1993, ICML.

[8]  Leslie Pack Kaelbling,et al.  Acting Optimally in Partially Observable Stochastic Domains , 1994, AAAI.

[9]  Michael L. Littman,et al.  Memoryless policies: theoretical limitations and practical results , 1994 .

[10]  Martin L. Puterman,et al.  Markov Decision Processes: Discrete Stochastic Dynamic Programming , 1994 .

[11]  Michael I. Jordan,et al.  Reinforcement Learning Algorithm for Partially Observable Markov Decision Problems , 1994, NIPS.

[12]  Michael I. Jordan,et al.  Learning Without State-Estimation in Partially Observable Markovian Decision Processes , 1994, ICML.

[13]  Leemon C. Baird,et al.  Residual Algorithms: Reinforcement Learning with Function Approximation , 1995, ICML.

[14]  Andrew W. Moore,et al.  Reinforcement Learning: A Survey , 1996, J. Artif. Intell. Res..

[15]  Andrew McCallum,et al.  Reinforcement learning with selective perception and hidden state , 1996 .

[16]  John N. Tsitsiklis,et al.  Neuro-Dynamic Programming , 1996, Encyclopedia of Machine Learning.

[17]  Jürgen Schmidhuber,et al.  HQ-Learning , 1997, Adapt. Behav..

[18]  Eric A. Hansen,et al.  An Improved Policy Iteration Algorithm for Partially Observable MDPs , 1997, NIPS.

[19]  Milos Hauskrecht,et al.  Planning and control in stochastic domains with imperfect information , 1997 .

[20]  Leslie Pack Kaelbling,et al.  Planning and Acting in Partially Observable Stochastic Domains , 1998, Artif. Intell..

[21]  Eric A. Hansen,et al.  Solving POMDPs by Searching in Policy Space , 1998, UAI.

[22]  Andrew W. Moore,et al.  Gradient Descent for General Reinforcement Learning , 1998, NIPS.

[23]  Richard S. Sutton,et al.  Introduction to Reinforcement Learning , 1998 .

[24]  A. Cassandra,et al.  Exact and approximate algorithms for partially observable markov decision processes , 1998 .

[25]  Shlomo Zilberstein,et al.  Finite-memory control of partially observable systems , 1998 .

[26]  Kee-Eung Kim,et al.  Solving POMDPs by Searching the Space of Finite Policies , 1999, UAI.

[27]  Leslie Pack Kaelbling,et al.  Learning Policies with External Memory , 1999, ICML.

[28]  Craig Boutilier,et al.  Decision-Theoretic Planning: Structural Assumptions and Computational Leverage , 1999, J. Artif. Intell. Res..