论文信息 - Policy-Gradients for PSRs and POMDPs

Policy-Gradients for PSRs and POMDPs

In uncertain and partially observable environments control policies must be a function of the complete history of actions and observations. Rather than present an ever growing history to a learner, we instead track sufficient statistics of the history and map those to a control policy. The mapping has typically been done using dynamic programming, requiring large amounts of memory. We present a general approach to mapping sufficient statistics directly to control policies by combining the tracking of sufficient statistics with the use of policy-gradient reinforcement learning. The best known sufficient statistic is the belief state, computed from a known or estimated partially observable Markov decision process (POMDP) model. More recently, predictive state representations (PSRs) have emerged as a potentially compact model of partially observable systems. Our experiments explore the usefulness of both of these sufficient statistics, exact and estimated, in direct policy-search.

[1] Reid G. Simmons,et al. Probabilistic Robot Navigation in Partially Observable Environments , 1995, IJCAI.

[2] Richard S. Sutton,et al. Using Predictive Representations to Improve Generalization in Reinforcement Learning , 2005, IJCAI.

[3] Peter L. Bartlett,et al. Experiments with Infinite-Horizon, Policy-Gradient Estimation , 2001, J. Artif. Intell. Res..

[4] Richard S. Sutton,et al. Predictive Representations of State , 2001, NIPS.

[5] Michael R. James,et al. Predictive State Representations: A New Theory for Modeling Dynamical Systems , 2004, UAI.

[6] Michael L. Littman,et al. Planning with predictive state representations , 2004, 2004 International Conference on Machine Learning and Applications, 2004. Proceedings..

[7] Michael H. Bowling,et al. Online Discovery and Learning of Predictive State Representations , 2005, NIPS.

[8] Lonnie Chrisman,et al. Reinforcement Learning with Perceptual Aliasing: The Perceptual Distinctions Approach , 1992, AAAI.

[9] Michael H. Bowling,et al. Learning predictive state representations using non-blind policies , 2006, ICML '06.

[10] Andrew McCallum,et al. Reinforcement learning with selective perception and hidden state , 1996 .

[11] Yishay Mansour,et al. Policy Gradient Methods for Reinforcement Learning with Function Approximation , 1999, NIPS.

[12] Stefan Schaal,et al. Natural Actor-Critic , 2003, Neurocomputing.