论文信息 - Policy-Gradient Algorithms for Partially Observable Markov Decision Processes

Policy-Gradient Algorithms for Partially Observable Markov Decision Processes

Partially observable Markov decision processes are interesting because of their ability to model most conceivable real-world learning problems, for example, robot navigation, driving a car, speech recognition, stock trading, and playing games. The downside of this generality is that exact algorithms are computationally intractable. Such computational complexity motivates approximate approaches. One such class of algorithms are the so-called policy-gradient methods from reinforcement learning. They seek to adjust the parameters of an agent in the direction that maximises the long-term average of a reward signal. Policy-gradient methods are attractive as a scalable approach for controlling partially observable Markov decision processes (POMDPs). In the most general case POMDP policies require some form of internal state, or memory, in order to act optimally. Policy-gradient methods have shown promise for problems admitting memory-less policies but have been less successful when memory is required. This thesis develops several improved algorithms for learning policies with memory in an infinite-horizon setting. Directly, when the dynamics of the world are known, and via Monte-Carlo methods otherwise. The algorithms simultaneously learn how to act and what to remember. Monte-Carlo policy-gradient approaches tend to produce gradient estimates with high variance. Two novel methods for reducing variance are introduced. The first uses high-order filters to replace the eligibility trace of the gradient estimator. The second uses a low-variance value-function method to learn a subset of the parameters and a policy-gradient method to learn the remainder. The algorithms are applied to large domains including a simulated robot navigation scenario, a multi-agent scenario with 21,000 states, and the complex real-world task of large vocabulary continuous speech recognition. To the best of the author’s knowledge, no other policy-gradient algorithms have performed well at such tasks. The high variance of Monte-Carlo methods requires lengthy simulation and hence a super-computer to train agents within a reasonable time. The ANU “Bunyip” Linux cluster was built with such tasks in mind. It was used for several of the experimental results presented here. One chapter of this thesis describes an application written for the Bunyip cluster that won the international Gordon-Bell prize for price/performance in 2001.

Douglas Aberdeen | D. Aberdeen | Douglas Aberdeen

[1] Ronald A. Howard,et al. Dynamic Programming and Markov Processes , 1960 .

[2] R. Bellman,et al. Polynomial approximation—a new computational technique in dynamic programming: Allocation processes , 1962 .

[3] Andrew J. Viterbi,et al. Error bounds for convolutional codes and an asymptotically optimum decoding algorithm , 1967, IEEE Trans. Inf. Theory.

[4] V. Strassen. Gaussian elimination is not optimal , 1969 .

[5] Stephen M. Pollock,et al. A Simple Model of Search for a Moving Target , 1970, Oper. Res..

[6] Edward J. Sondik,et al. Toward an Integrated Methodology for the Analysis of Health-Care Systems , 1971, Oper. Res..

[7] E. J. Sondik,et al. The Optimal Control of Partially Observable Markov Decision Processes. , 1971 .

[8] Edward J. Sondik,et al. The Optimal Control of Partially Observable Markov Processes over a Finite Horizon , 1973, Oper. Res..

[9] R. Bakis. Continuous speech recognition via centisecond acoustic states , 1976 .

[10] D. Rubin,et al. Maximum likelihood from incomplete data via the EM - algorithm plus discussions on the paper , 1977 .

[11] Harold J. Kushner,et al. wchastic. approximation methods for constrained and unconstrained systems , 1978 .

[12] Edward J. Sondik,et al. The Optimal Control of Partially Observable Markov Processes over the Infinite Horizon: Discounted Costs , 1978, Oper. Res..

[13] J. Douglas Faires,et al. Numerical Analysis , 1981 .

[14] Nils J. Nilsson,et al. Principles of Artificial Intelligence , 1980, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[15] John G. Proakis,et al. Digital Communications , 1983 .

[16] Peter W. Glynn,et al. Stochastic approximation for Monte Carlo optimization , 1986, WSC '86.

[17] Geoffrey E. Hinton,et al. Learning internal representations by error propagation , 1986 .

[18] Alan Weiss,et al. Sensitivity analysis via likelihood ratios , 1986, WSC '86.

[19] John N. Tsitsiklis,et al. The Complexity of Markov Decision Processes , 1987, Math. Oper. Res..

[20] A. Poritz,et al. Hidden Markov models: a guided tour , 1988, ICASSP-88., International Conference on Acoustics, Speech, and Signal Processing.

[21] Raj Reddy,et al. Large-vocabulary speaker-independent continuous speech recognition: the sphinx system , 1988 .

[22] Richard Lippmann,et al. Review of Neural Networks for Speech Recognition , 1989, Neural Computation.

[23] A. Nadas,et al. A generalization of the Baum algorithm to rational objective functions , 1989, International Conference on Acoustics, Speech, and Signal Processing,.

[24] Lawrence R. Rabiner,et al. A tutorial on hidden Markov models and selected applications in speech recognition , 1989, Proc. IEEE.

[25] Keiji Kanazawa,et al. A model for reasoning about persistence and causation , 1989 .

[26] Ronald J. Williams,et al. A Learning Algorithm for Continually Running Fully Recurrent Neural Networks , 1989, Neural Computation.

[27] Alexander H. Waibel,et al. Modular Construction of Time-Delay Neural Networks for Speech Recognition , 1989, Neural Computation.

[28] Alan Weiss,et al. Sensitivity Analysis for Simulations via Likelihood Ratios , 1989, Oper. Res..

[29] Harvey F. Silverman,et al. Combining hidden Markov model and neural network classifiers , 1990, International Conference on Acoustics, Speech, and Signal Processing.

[30] Douglas B. Paul,et al. Speech Recognition Using Hidden Markov Models , 1990 .

[31] Jürgen Schmidhuber,et al. Reinforcement Learning in Markovian and Non-Markovian Environments , 1990, NIPS.

[32] S. Young. Competitive training in hidden Markov models , 1990 .

[33] Yariv Ephraim,et al. Estimation of hidden Markov model parameters by minimizing empirical error rate , 1990, International Conference on Acoustics, Speech, and Signal Processing.

[34] Gerald Tesauro,et al. Neurogammon: a neural-network backgammon program , 1990, 1990 IJCNN International Joint Conference on Neural Networks.

[35] Peter W. Glynn,et al. Likelihood ratio gradient estimation for stochastic systems , 1990, CACM.

[36] John S. Bridle,et al. Alpha-nets: A recurrent 'neural' network architecture with a hidden Markov model interpretation , 1990, Speech Commun..

[37] Berndt Müller,et al. Neural networks: an introduction , 1990 .

[38] D. Van Compernolle,et al. TDNN labeling for a HMM recognizer , 1990, International Conference on Acoustics, Speech, and Signal Processing.

[39] Alex Waibel,et al. Connectionist speaker normalization and its applications to speech recognition , 1991, Neural Networks for Signal Processing Proceedings of the 1991 IEEE Workshop.

[40] Richard Lippmann,et al. Neural Network Classifiers Estimate Bayesian a posteriori Probabilities , 1991, Neural Computation.

[41] W. Lovejoy. A survey of algorithmic methods for partially observed Markov decision processes , 1991 .

[42] J. S. Bridle,et al. An Alphanet approach to optimising input transformations for continuous speech recognition , 1991, [Proceedings] ICASSP 91: 1991 International Conference on Acoustics, Speech, and Signal Processing.

[43] Régis Cardin,et al. Developments in High-Performance Connected Digit Recognition , 1992 .

[44] Jürgen Schmidhuber,et al. Learning Complex, Extended Sequences Using the Principle of History Compression , 1992, Neural Computation.

[45] Yves Normandin,et al. Hidden Markov models, maximum mutual information estimation, and the speech recognition problem , 1992 .

[46] Long Lin,et al. Memory Approaches to Reinforcement Learning in Non-Markovian Domains , 1992 .

[47] Yoshua Bengio,et al. Global optimization of a neural network-hidden Markov model hybrid , 1992, IEEE Trans. Neural Networks.

[48] Dana Ron,et al. The Power of Amnesia , 1993, NIPS.

[49] Mei-Yuh Hwang,et al. Shared-distribution hidden Markov models for speech recognition , 1993, IEEE Trans. Speech Audio Process..

[50] Hervé Bourlard,et al. Connectionist Speech Recognition: A Hybrid Approach , 1993 .

[51] J. Bruce Millar,et al. Two schemes of phonetic feature extraction using artificial neural networks , 1993, EUROSPEECH.

[52] Jonathan G. Fiscus,et al. Darpa Timit Acoustic-Phonetic Continuous Speech Corpus CD-ROM {TIMIT} | NIST , 1993 .

[53] R. I. Bahar,et al. Algebraic decision diagrams and their applications , 1993, Proceedings of 1993 International Conference on Computer Aided Design (ICCAD).

[54] Qiang Huo,et al. The gradient projection method for the training of hidden Markov models , 1993, Speech Commun..

[55] S. Hyakin,et al. Neural Networks: A Comprehensive Foundation , 1994 .

[56] Nelson Morgan. Big dumb neural nets: a working brute force approach to speech recognition , 1994, Proceedings of 1994 IEEE International Conference on Neural Networks (ICNN'94).

[57] Daniel S. Weld,et al. A Probablistic Model of Action for Least-Commitment Planning with Information Gathering , 1994, UAI.

[58] Yoshua Bengio,et al. An Input Output HMM Architecture , 1994, NIPS.

[59] Gerald Tesauro,et al. TD-Gammon, a Self-Teaching Backgammon Program, Achieves Master-Level Play , 1994, Neural Computation.

[60] Leslie Pack Kaelbling,et al. Acting Optimally in Partially Observable Stochastic Domains , 1994, AAAI.

[61] Daw-Tung Lin,et al. The Adaptive Time-Delay Neural Network: Characterization and Applications to, Pattern Recognition, Prediction and Signal Processing , 1994 .

[62] Michael I. Jordan,et al. Reinforcement Learning Algorithm for Partially Observable Markov Decision Problems , 1994, NIPS.

[63] Michael I. Jordan,et al. Learning Without State-Estimation in Partially Observable Markovian Decision Processes , 1994, ICML.

[64] T. R. Anderson,et al. Auditory models with Kohonen SOFM and LVQ for speaker independent phoneme recognition , 1994, Proceedings of 1994 IEEE International Conference on Neural Networks (ICNN'94).

[65] Anthony J. Robinson,et al. An application of recurrent nets to phone probability estimation , 1994, IEEE Trans. Neural Networks.