Reinforcement learning for factored Markov decision processes

Learning to act optimally in a complex, dynamic and noisy environment is a hard problem. Various threads of research from reinforcement learning, animal conditioning, operations research, machine learning, statistics and optimal control are beginning to come together to offer solutions to this problem. I present a thesis in which novel algorithms are presented for learning the dynamics, learning the value function, and selecting good actions for Markov decision processes. The problems considered have high-dimensional factored state and action spaces, and are either fully or partially observable. The approach I take is to recognize similarities between the problems being solved in the reinforcement learning and graphical models literature, and to use and combine techniques from the two fields in novel ways. In particular I present two new algorithms. First, the DBN algorithm learns a compact representation of the core process of a partially observable MDP. Because inference in the DBN is intractable, I use approximate inference to maintain the belief state. A belief state action-value function is learned using reinforcement learning. I show that this DBN algorithm can solve POMDPs with very large state spaces and useful hidden state. Second, the PoE algorithm learns an approximation to value functions over large factored state-action spaces. The algorithm approximates values as (negative) free energies in a product of experts model. The model parameters can be learned efficiently because inference is tractable in a product of experts. I show that good actions can be found even in large factored action spaces by the use of brief Gibbs sampling. These two new algorithms take techniques from the machine learning community and apply them in new ways to reinforcement learning problems. Simulation results show that these new methods can be used to solve very large problems. The DBN method is used to solve a POMDP with a hidden state space and an observation space of size greater than 2180. The DBN model of the core process has 232 states represented as 32 binary variables. The PoE method is used to find actions in action spaces of size 240 .

[1]  R. Bellman A Markovian Decision Process , 1957 .

[2]  R. Bellman,et al.  Dynamic Programming and Markov Processes , 1960 .

[3]  L. Baum,et al.  Statistical Inference for Probabilistic Functions of Finite State Markov Chains , 1966 .

[4]  L. Baum,et al.  A Maximization Technique Occurring in the Statistical Analysis of Probabilistic Functions of Markov Chains , 1970 .

[5]  Edward J. Sondik,et al.  The Optimal Control of Partially Observable Markov Processes over a Finite Horizon , 1973, Oper. Res..

[6]  D. Rubin,et al.  Maximum likelihood from incomplete data via the EM - algorithm plus discussions on the paper , 1977 .

[7]  Edward J. Sondik,et al.  The Optimal Control of Partially Observable Markov Processes over the Infinite Horizon: Discounted Costs , 1978, Oper. Res..

[8]  R. Shumway,et al.  AN APPROACH TO TIME SERIES SMOOTHING AND FORECASTING USING THE EM ALGORITHM , 1982 .

[9]  Richard S. Sutton,et al.  Neuronlike adaptive elements that can solve difficult learning control problems , 1983, IEEE Transactions on Systems, Man, and Cybernetics.

[10]  Donald Geman,et al.  Stochastic Relaxation, Gibbs Distributions, and the Bayesian Restoration of Images , 1984, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[11]  Geoffrey E. Hinton,et al.  A Learning Algorithm for Boltzmann Machines , 1985, Cogn. Sci..

[12]  D. Rumelhart Learning internal representations by back-propagating errors , 1986 .

[13]  Paul Smolensky,et al.  Information processing in dynamical systems: foundations of harmony theory , 1986 .

[14]  Geoffrey E. Hinton,et al.  Learning and relearning in Boltzmann machines , 1986 .

[15]  Ross D. Shachter Evaluating Influence Diagrams , 1986, Oper. Res..

[16]  Gregory F. Cooper,et al.  A Method for Using Belief Networks as Influence Diagrams , 2013, UAI 1988.

[17]  李幼升,et al.  Ph , 1989 .

[18]  C. Watkins Learning from delayed rewards , 1989 .

[19]  Judea Pearl,et al.  Probabilistic reasoning in intelligent systems - networks of plausible inference , 1991, Morgan Kaufmann series in representation and reasoning.

[20]  Keiji Kanazawa,et al.  A model for reasoning about persistence and causation , 1989 .

[21]  C. Robert Kenley,et al.  Gaussian influence diagrams , 1989 .

[22]  Richard S. Sutton,et al.  Integrated Architectures for Learning, Planning, and Reacting Based on Approximating Dynamic Programming , 1990, ML.

[23]  Gregory F. Cooper,et al.  The Computational Complexity of Probabilistic Inference Using Bayesian Belief Networks , 1990, Artif. Intell..

[24]  David Haussler,et al.  Unsupervised learning of distributions on binary vectors using two layer networks , 1991, NIPS 1991.

[25]  W. Lovejoy A survey of algorithmic methods for partially observed Markov decision processes , 1991 .

[26]  P. Dayan The Convergence of TD(λ) for General λ , 1992, Machine Learning.

[27]  Andreas Stolcke,et al.  Hidden Markov Model} Induction by Bayesian Model Merging , 1992, NIPS.

[28]  Ross D. Shachter,et al.  Decision Making Using Probabilistic Inference Methods , 1992, UAI.

[29]  Radford M. Neal Connectionist Learning of Belief Networks , 1992, Artif. Intell..

[30]  Lonnie Chrisman,et al.  Reinforcement Learning with Perceptual Aliasing: The Perceptual Distinctions Approach , 1992, AAAI.

[31]  Holly A. Yanco,et al.  An adaptive communication protocol for cooperating mobile robots , 1993 .

[32]  Michael Luby,et al.  Approximating Probabilistic Inference in Bayesian Belief Networks is NP-Hard , 1993, Artif. Intell..

[33]  Leemon C Baird,et al.  Reinforcement Learning With High-Dimensional, Continuous Actions , 1993 .

[34]  Mahesan Niranjan,et al.  On-line Q-learning using connectionist systems , 1994 .

[35]  Martin L. Puterman,et al.  Markov Decision Processes: Discrete Stochastic Dynamic Programming , 1994 .

[36]  Solomon Eyal Shimony,et al.  Finding MAPs for Belief Networks is NP-Hard , 1994, Artif. Intell..

[37]  Michael I. Jordan,et al.  Reinforcement Learning Algorithm for Partially Observable Markov Decision Problems , 1994, NIPS.

[38]  Jing Peng,et al.  Incremental multi-step Q-learning , 1994, Machine-mediated learning.

[39]  Kenji Doya,et al.  Temporal Difference Learning in Continuous Time and Space , 1995, NIPS.

[40]  Stuart J. Russell,et al.  Approximating Optimal Policies for Partially Observable Stochastic Domains , 1995, IJCAI.

[41]  Michael I. Jordan,et al.  Reinforcement Learning by Probability Matching , 1995, NIPS 1995.

[42]  Leslie Pack Kaelbling,et al.  Learning Policies for Partially Observable Environments: Scaling Up , 1997, ICML.

[43]  Richard S. Sutton,et al.  Generalization in Reinforcement Learning: Successful Examples Using Sparse Coarse Coding , 1995, NIPS.

[44]  Geoffrey E. Hinton,et al.  Bayesian Learning for Neural Networks , 1995 .

[45]  Andrew G. Barto,et al.  Learning to Act Using Real-Time Dynamic Programming , 1995, Artif. Intell..

[46]  M. Littman,et al.  Efficient dynamic-programming updates in partially observable Markov decision processes , 1995 .

[47]  Michael I. Jordan,et al.  Mean Field Theory for Sigmoid Belief Networks , 1996, J. Artif. Intell. Res..

[48]  David J. C. MacKay,et al.  BAYESIAN NON-LINEAR MODELING FOR THE PREDICTION COMPETITION , 1996 .

[49]  Wenju Liu,et al.  Planning in Stochastic Domains: Problem Characteristics and Approximation , 1996 .

[50]  Andrew McCallum,et al.  Reinforcement learning with selective perception and hidden state , 1996 .

[51]  John N. Tsitsiklis,et al.  Neuro-Dynamic Programming , 1996, Encyclopedia of Machine Learning.

[52]  Craig Boutilier,et al.  Computing Optimal Policies for Partially Observable Decision Processes Using Compact Representations , 1996, AAAI/IAAI, Vol. 2.

[53]  Craig Boutilier,et al.  Approximate Value Trees in Structured Dynamic Programming , 1996, ICML.

[54]  Prasad Tadepalli,et al.  Scaling Up Average Reward Reinforcement Learning by Approximating the Domain Models and the Value Function , 1996, ICML.

[55]  Michael I. Jordan,et al.  Variational methods for inference and estimation in graphical models , 1997 .

[56]  Stuart J. Russell,et al.  Reinforcement Learning with Hierarchies of Machines , 1997, NIPS.

[57]  Ashwin Ram,et al.  Experiments with Reinforcement Learning in Problems with Continuous State and Action Spaces , 1997, Adapt. Behav..

[58]  Geoffrey E. Hinton,et al.  Generative models for discovering sparse distributed representations. , 1997, Philosophical transactions of the Royal Society of London. Series B, Biological sciences.

[59]  Craig Boutilier,et al.  Abstraction and Approximate Decision-Theoretic Planning , 1997, Artif. Intell..

[60]  Michael L. Littman,et al.  Incremental Pruning: A Simple, Fast, Exact Method for Partially Observable Markov Decision Processes , 1997, UAI.

[61]  A. McCallum Efficient Exploration in Reinforcement Learning with Hidden State , 1997 .

[62]  Doina Precup,et al.  Theoretical Results on Reinforcement Learning with Temporally Abstract Options , 1998, ECML.

[63]  Brian Sallans,et al.  A Hierarchical Community of Experts , 1999, Learning in Graphical Models.

[64]  Stuart J. Russell Learning agents for uncertain environments (extended abstract) , 1998, COLT' 98.

[65]  Nevin Lianwen Zhang,et al.  Probabilistic Inference in Influence Diagrams , 1998, Comput. Intell..

[66]  Kee-Eung Kim,et al.  Solving Stochastic Planning Problems with Large State and Action Spaces , 1998, AIPS.

[67]  Kee-Eung Kim,et al.  Solving Very Large Weakly Coupled Markov Decision Processes , 1998, AAAI/IAAI.

[68]  Radford M. Neal Assessing Relevance determination methods using DELVE , 1998 .

[69]  Xavier Boyen,et al.  Tractable Inference for Complex Stochastic Processes , 1998, UAI.

[70]  Eric A. Hansen,et al.  Solving POMDPs by Searching in Policy Space , 1998, UAI.

[71]  Andrew W. Moore,et al.  Gradient Descent for General Reinforcement Learning , 1998, NIPS.

[72]  Amy McGovern,et al.  AcQuire-macros: An Algorithm for Automatically Learning Macro-actions , 1998 .

[73]  A. Cassandra,et al.  Exact and approximate algorithms for partially observable markov decision processes , 1998 .

[74]  Michael I. Jordan,et al.  An Introduction to Variational Methods for Graphical Models , 1999, Machine-mediated learning.

[75]  Shin Ishii,et al.  Reinforcement Learning Based on On-Line EM Algorithm , 1998, NIPS.

[76]  Christopher M. Bishop,et al.  Neural networks and machine learning , 1998 .

[77]  Mark A. Shayman,et al.  Solving POMDP by Onolicy Linear Approximate Learning Algorithm , 1999 .

[78]  Brian Sallans,et al.  Learning Factored Representations for Partially Observable Markov Decision Processes , 1999, NIPS.

[79]  Daphne Koller,et al.  Computing Factored Value Functions for Policies in Structured MDPs , 1999, IJCAI.

[80]  Leslie Pack Kaelbling,et al.  Learning Policies with External Memory , 1999, ICML.

[81]  David A. McAllester,et al.  Approximate Planning for Factored POMDPs using Belief State Simplification , 1999, UAI.

[82]  Geoffrey E. Hinton Products of experts , 1999 .

[83]  Yishay Mansour,et al.  Policy Gradient Methods for Reinforcement Learning with Function Approximation , 1999, NIPS.

[84]  Daphne Koller,et al.  Reinforcement Learning Using Approximate Belief States , 1999, NIPS.

[85]  Vijay R. Konda,et al.  Actor-Critic Algorithms , 1999, NIPS.

[86]  Andrew W. Moore,et al.  Distributed Value Functions , 1999, ICML.

[87]  Sebastian Thrun,et al.  Monte Carlo POMDPs , 1999, NIPS.

[88]  Andrew Y. Ng,et al.  Policy Search via Density Estimation , 1999, NIPS.

[89]  Craig Boutilier,et al.  Value-Directed Belief State Approximation for POMDPs , 2000, UAI.

[90]  Kee-Eung Kim,et al.  Learning to Cooperate via Policy Search , 2000, UAI.

[91]  Daphne Koller,et al.  Policy Iteration for Factored MDPs , 2000, UAI.

[92]  Geoffrey E. Hinton,et al.  Using Free Energies to Represent Q-values in a Multiagent Reinforcement Learning Task , 2000, NIPS.

[93]  Thomas G. Dietterich Hierarchical Reinforcement Learning with the MAXQ Value Function Decomposition , 1999, J. Artif. Intell. Res..

[94]  Geoffrey J. Gordon Reinforcement Learning with Function Approximation Converges to a Region , 2000, NIPS.

[95]  Yee Whye Teh,et al.  Rate-coded Restricted Boltzmann Machines for Face Recognition , 2000, NIPS.

[96]  Craig Boutilier,et al.  Stochastic dynamic programming with factored representations , 2000, Artif. Intell..

[97]  Jesse Hoey,et al.  APRICODD: Approximate Policy Construction Using Decision Diagrams , 2000, NIPS.

[98]  Michael I. Jordan,et al.  PEGASUS: A policy search method for large MDPs and POMDPs , 2000, UAI.

[99]  Prakash P. Shenoy,et al.  A Forward Monte Carlo Method For Solving Influence Diagrams Using Local Computation , 2000 .

[100]  Katia P. Sycara,et al.  Evolutionary Search, Stochastic Policies with Memory, and Reinforcement Learning with Hidden State , 2001, ICML.

[101]  C. Lee Giles,et al.  How communication can improve the performance of multi-agent systems , 2001, AGENTS '01.

[102]  Craig Boutilier,et al.  Value-directed sampling methods for monitoring POMDPs , 2001, UAI 2001.

[103]  Craig Boutilier,et al.  Vector-space Analysis of Belief-state Approximation for POMDPs , 2001, UAI.

[104]  Yee Whye Teh,et al.  Discovering Multiple Constraints that are Frequently Approximately Satisfied , 2001, UAI.

[105]  Terrence J. Sejnowski,et al.  Variational Learning for Switching State-Space Models , 2001 .

[106]  Geoffrey E. Hinton,et al.  Products of Hidden Markov Models , 2001, AISTATS.

[107]  Carlos Guestrin,et al.  Multiagent Planning with Factored MDPs , 2001, NIPS.

[108]  T. Başar,et al.  A New Approach to Linear Filtering and Prediction Problems , 2001 .

[109]  Simon J. Godsill,et al.  Marginal maximum a posteriori estimation using Markov chain Monte Carlo , 2002, Stat. Comput..

[110]  Geoffrey E. Hinton Training Products of Experts by Minimizing Contrastive Divergence , 2002, Neural Computation.

[111]  Peter Dayan,et al.  Q-learning , 1992, Machine Learning.

[112]  Peter Dayan,et al.  Analytical Mean Squared Error Curves for Temporal Difference Learning , 1996, Machine Learning.

[113]  Richard S. Sutton,et al.  Reinforcement learning with replacing eligibility traces , 2004, Machine Learning.

[114]  Michael I. Jordan,et al.  MASSACHUSETTS INSTITUTE OF TECHNOLOGY ARTIFICIAL INTELLIGENCE LABORATORY and CENTER FOR BIOLOGICAL AND COMPUTATIONAL LEARNING DEPARTMENT OF BRAIN AND COGNITIVE SCIENCES , 2001 .

[115]  R. J. Williams,et al.  Simple Statistical Gradient-Following Algorithms for Connectionist Reinforcement Learning , 2004, Machine Learning.

[116]  Sean R Eddy,et al.  What is dynamic programming? , 2004, Nature Biotechnology.

[117]  R. Dearden Structured Prioritized Sweeping , 2022 .