Using Free Energies to Represent Q-values in a Multiagent Reinforcement Learning Task

The problem of reinforcement learning in large factored Markov decision processes is explored. The Q-value of a state-action pair is approximated by the free energy of a product of experts network. Network parameters are learned on-line using a modified SARSA algorithm which minimizes the inconsistency of the Q-values of consecutive state-action pairs. Actions are chosen based on the current value estimates by fixing the current state and sampling actions from the network using Gibbs sampling. The algorithm is tested on a co-operative multi-agent task. The product of experts model is found to perform comparably to table-based Q-learning for small instances of the task, and continues to perform well when the problem becomes too large for a table-based representation.

[1]  Brian Sallans,et al.  Learning Factored Representations for Partially Observable Markov Decision Processes , 1999, NIPS.

[2]  Michael I. Jordan,et al.  Reinforcement Learning Algorithm for Partially Observable Markov Decision Problems , 1994, NIPS.

[3]  Geoffrey E. Hinton Training Products of Experts by Minimizing Contrastive Divergence , 2002, Neural Computation.

[4]  Radford M. Neal Connectionist Learning of Belief Networks , 1992, Artif. Intell..

[5]  Richard S. Sutton,et al.  Introduction to Reinforcement Learning , 1998 .

[6]  John N. Tsitsiklis,et al.  Neuro-Dynamic Programming , 1996, Encyclopedia of Machine Learning.

[7]  Donald Geman,et al.  Stochastic Relaxation, Gibbs Distributions, and the Bayesian Restoration of Images , 1984, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[8]  Richard S. Sutton,et al.  Neuronlike adaptive elements that can solve difficult learning control problems , 1983, IEEE Transactions on Systems, Man, and Cybernetics.

[9]  Richard S. Sutton,et al.  Generalization in Reinforcement Learning: Successful Examples Using Sparse Coarse Coding , 1995, NIPS.

[10]  Leslie Pack Kaelbling,et al.  Learning Policies for Partially Observable Environments: Scaling Up , 1997, ICML.

[11]  Michael I. Jordan,et al.  Advances in Neural Information Processing Systems 30 , 1995 .

[12]  Michael I. Jordan,et al.  Reinforcement Learning by Probability Matching , 1995, NIPS 1995.

[13]  Richard S. Sutton,et al.  Integrated Architectures for Learning, Planning, and Reacting Based on Approximating Dynamic Programming , 1990, ML.

[14]  Michael I. Jordan,et al.  Variational methods for inference and estimation in graphical models , 1997 .

[15]  G. B. Smith,et al.  Preface to S. Geman and D. Geman, “Stochastic relaxation, Gibbs distributions, and the Bayesian restoration of images” , 1987 .

[16]  Mahesan Niranjan,et al.  On-line Q-learning using connectionist systems , 1994 .