Policy-Gradient Algorithms for Partially Observable Markov Decision Processes

Partially observable Markov decision processes are interesting because of their ability to model most conceivable real-world learning problems, for example, robot navigation, driving a car, speech recognition, stock trading, and playing games. The downside of this generality is that exact algorithms are computationally intractable. Such computational complexity motivates approximate approaches. One such class of algorithms are the so-called policy-gradient methods from reinforcement learning. They seek to adjust the parameters of an agent in the direction that maximises the long-term average of a reward signal. Policy-gradient methods are attractive as a scalable approach for controlling partially observable Markov decision processes (POMDPs). In the most general case POMDP policies require some form of internal state, or memory, in order to act optimally. Policy-gradient methods have shown promise for problems admitting memory-less policies but have been less successful when memory is required. This thesis develops several improved algorithms for learning policies with memory in an infinite-horizon setting. Directly, when the dynamics of the world are known, and via Monte-Carlo methods otherwise. The algorithms simultaneously learn how to act and what to remember. Monte-Carlo policy-gradient approaches tend to produce gradient estimates with high variance. Two novel methods for reducing variance are introduced. The first uses high-order filters to replace the eligibility trace of the gradient estimator. The second uses a low-variance value-function method to learn a subset of the parameters and a policy-gradient method to learn the remainder. The algorithms are applied to large domains including a simulated robot navigation scenario, a multi-agent scenario with 21,000 states, and the complex real-world task of large vocabulary continuous speech recognition. To the best of the author’s knowledge, no other policy-gradient algorithms have performed well at such tasks. The high variance of Monte-Carlo methods requires lengthy simulation and hence a super-computer to train agents within a reasonable time. The ANU “Bunyip” Linux cluster was built with such tasks in mind. It was used for several of the experimental results presented here. One chapter of this thesis describes an application written for the Bunyip cluster that won the international Gordon-Bell prize for price/performance in 2001.

[1]  Ronald A. Howard,et al.  Dynamic Programming and Markov Processes , 1960 .

[2]  R. Bellman,et al.  Polynomial approximation—a new computational technique in dynamic programming: Allocation processes , 1962 .

[3]  Andrew J. Viterbi,et al.  Error bounds for convolutional codes and an asymptotically optimum decoding algorithm , 1967, IEEE Trans. Inf. Theory.

[4]  V. Strassen Gaussian elimination is not optimal , 1969 .

[5]  Stephen M. Pollock,et al.  A Simple Model of Search for a Moving Target , 1970, Oper. Res..

[6]  Edward J. Sondik,et al.  Toward an Integrated Methodology for the Analysis of Health-Care Systems , 1971, Oper. Res..

[7]  E. J. Sondik,et al.  The Optimal Control of Partially Observable Markov Decision Processes. , 1971 .

[8]  Edward J. Sondik,et al.  The Optimal Control of Partially Observable Markov Processes over a Finite Horizon , 1973, Oper. Res..

[9]  R. Bakis Continuous speech recognition via centisecond acoustic states , 1976 .

[10]  D. Rubin,et al.  Maximum likelihood from incomplete data via the EM - algorithm plus discussions on the paper , 1977 .

[11]  Harold J. Kushner,et al.  wchastic. approximation methods for constrained and unconstrained systems , 1978 .

[12]  Edward J. Sondik,et al.  The Optimal Control of Partially Observable Markov Processes over the Infinite Horizon: Discounted Costs , 1978, Oper. Res..

[13]  J. Douglas Faires,et al.  Numerical Analysis , 1981 .

[14]  Nils J. Nilsson,et al.  Principles of Artificial Intelligence , 1981, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[15]  John G. Proakis,et al.  Digital Communications , 1983 .

[16]  Peter W. Glynn,et al.  Stochastic approximation for Monte Carlo optimization , 1986, WSC '86.

[17]  Geoffrey E. Hinton,et al.  Learning internal representations by error propagation , 1986 .

[18]  Alan Weiss,et al.  Sensitivity analysis via likelihood ratios , 1986, WSC '86.

[19]  John N. Tsitsiklis,et al.  The Complexity of Markov Decision Processes , 1987, Math. Oper. Res..

[20]  A. Poritz,et al.  Hidden Markov models: a guided tour , 1988, ICASSP-88., International Conference on Acoustics, Speech, and Signal Processing.

[21]  Raj Reddy,et al.  Large-vocabulary speaker-independent continuous speech recognition: the sphinx system , 1988 .

[22]  Richard Lippmann,et al.  Review of Neural Networks for Speech Recognition , 1989, Neural Computation.

[23]  A. Nadas,et al.  A generalization of the Baum algorithm to rational objective functions , 1989, International Conference on Acoustics, Speech, and Signal Processing,.

[24]  Lawrence R. Rabiner,et al.  A Tutorial on Hidden Markov Models and Selected Applications , 1989 .

[25]  Keiji Kanazawa,et al.  A model for reasoning about persistence and causation , 1989 .

[26]  Ronald J. Williams,et al.  A Learning Algorithm for Continually Running Fully Recurrent Neural Networks , 1989, Neural Computation.

[27]  Alexander H. Waibel,et al.  Modular Construction of Time-Delay Neural Networks for Speech Recognition , 1989, Neural Computation.

[28]  Alan Weiss,et al.  Sensitivity Analysis for Simulations via Likelihood Ratios , 1989, Oper. Res..

[29]  Harvey F. Silverman,et al.  Combining hidden Markov model and neural network classifiers , 1990, International Conference on Acoustics, Speech, and Signal Processing.

[30]  Douglas B. Paul,et al.  Speech Recognition Using Hidden Markov Models , 1990 .

[31]  Jürgen Schmidhuber,et al.  Reinforcement Learning in Markovian and Non-Markovian Environments , 1990, NIPS.

[32]  S. Young Competitive training in hidden Markov models , 1990 .

[33]  Yariv Ephraim,et al.  Estimation of hidden Markov model parameters by minimizing empirical error rate , 1990, International Conference on Acoustics, Speech, and Signal Processing.

[34]  Gerald Tesauro,et al.  Neurogammon: a neural-network backgammon program , 1990, 1990 IJCNN International Joint Conference on Neural Networks.

[35]  Peter W. Glynn,et al.  Likelihood ratio gradient estimation for stochastic systems , 1990, CACM.

[36]  John S. Bridle,et al.  Alpha-nets: A recurrent 'neural' network architecture with a hidden Markov model interpretation , 1990, Speech Commun..

[37]  Berndt Müller,et al.  Neural networks: an introduction , 1990 .

[38]  D. Van Compernolle,et al.  TDNN labeling for a HMM recognizer , 1990, International Conference on Acoustics, Speech, and Signal Processing.

[39]  Alex Waibel,et al.  Connectionist speaker normalization and its applications to speech recognition , 1991, Neural Networks for Signal Processing Proceedings of the 1991 IEEE Workshop.

[40]  Richard Lippmann,et al.  Neural Network Classifiers Estimate Bayesian a posteriori Probabilities , 1991, Neural Computation.

[41]  W. Lovejoy A survey of algorithmic methods for partially observed Markov decision processes , 1991 .

[42]  J. S. Bridle,et al.  An Alphanet approach to optimising input transformations for continuous speech recognition , 1991, [Proceedings] ICASSP 91: 1991 International Conference on Acoustics, Speech, and Signal Processing.

[43]  Régis Cardin,et al.  Developments in High-Performance Connected Digit Recognition , 1992 .

[44]  Jürgen Schmidhuber,et al.  Learning Complex, Extended Sequences Using the Principle of History Compression , 1992, Neural Computation.

[45]  Yves Normandin,et al.  Hidden Markov models, maximum mutual information estimation, and the speech recognition problem , 1992 .

[46]  Long Lin,et al.  Memory Approaches to Reinforcement Learning in Non-Markovian Domains , 1992 .

[47]  Yoshua Bengio,et al.  Global optimization of a neural network-hidden Markov model hybrid , 1992, IEEE Trans. Neural Networks.

[48]  Dana Ron,et al.  The Power of Amnesia , 1993, NIPS.

[49]  Mei-Yuh Hwang,et al.  Shared-distribution hidden Markov models for speech recognition , 1993, IEEE Trans. Speech Audio Process..

[50]  Hervé Bourlard,et al.  Connectionist Speech Recognition: A Hybrid Approach , 1993 .

[51]  J. Bruce Millar,et al.  Two schemes of phonetic feature extraction using artificial neural networks , 1993, EUROSPEECH.

[52]  Jonathan G. Fiscus,et al.  Darpa Timit Acoustic-Phonetic Continuous Speech Corpus CD-ROM {TIMIT} | NIST , 1993 .

[53]  R. I. Bahar,et al.  Algebraic decision diagrams and their applications , 1993, Proceedings of 1993 International Conference on Computer Aided Design (ICCAD).

[54]  Qiang Huo,et al.  The gradient projection method for the training of hidden Markov models , 1993, Speech Commun..

[55]  S. Hyakin,et al.  Neural Networks: A Comprehensive Foundation , 1994 .

[56]  Nelson Morgan Big dumb neural nets: a working brute force approach to speech recognition , 1994, Proceedings of 1994 IEEE International Conference on Neural Networks (ICNN'94).

[57]  Daniel S. Weld,et al.  A Probablistic Model of Action for Least-Commitment Planning with Information Gathering , 1994, UAI.

[58]  Yoshua Bengio,et al.  An Input Output HMM Architecture , 1994, NIPS.

[59]  Gerald Tesauro,et al.  TD-Gammon, a Self-Teaching Backgammon Program, Achieves Master-Level Play , 1994, Neural Computation.

[60]  Leslie Pack Kaelbling,et al.  Acting Optimally in Partially Observable Stochastic Domains , 1994, AAAI.

[61]  Daw-Tung Lin,et al.  The Adaptive Time-Delay Neural Network: Characterization and Applications to, Pattern Recognition, Prediction and Signal Processing , 1994 .

[62]  Michael I. Jordan,et al.  Reinforcement Learning Algorithm for Partially Observable Markov Decision Problems , 1994, NIPS.

[63]  Michael I. Jordan,et al.  Learning Without State-Estimation in Partially Observable Markovian Decision Processes , 1994, ICML.

[64]  T. R. Anderson,et al.  Auditory models with Kohonen SOFM and LVQ for speaker independent phoneme recognition , 1994, Proceedings of 1994 IEEE International Conference on Neural Networks (ICNN'94).

[65]  Anthony J. Robinson,et al.  An application of recurrent nets to phone probability estimation , 1994, IEEE Trans. Neural Networks.

[66]  Michael I. Jordan,et al.  Reinforcement Learning with Soft State Aggregation , 1994, NIPS.

[67]  Tom Michael Mitchell Learning Analytically and Inductively , 1995 .

[68]  Günther Ruske,et al.  Discriminative training for continuous speech recognition , 1995, EUROSPEECH.

[69]  P. Glynn,et al.  Likelihood ratio gradient estimation for stochastic recursions , 1995 .

[70]  Reid G. Simmons,et al.  Probabilistic Robot Navigation in Partially Observable Environments , 1995, IJCAI.

[71]  Stuart J. Russell,et al.  Approximating Optimal Policies for Partially Observable Stochastic Domains , 1995, IJCAI.

[72]  Lai-Wan Chan,et al.  An RNN based speech recognition system with discriminative training , 1995, EUROSPEECH.

[73]  Leslie Pack Kaelbling,et al.  Learning Policies for Partially Observable Environments: Scaling Up , 1997, ICML.

[74]  Nevin L. Zhang Efficient planning in stochastic domains through exploiting problem characteristics , 1995 .

[75]  Illah R. Nourbakhsh,et al.  DERVISH - An Office-Navigating Robot , 1995, AI Mag..

[76]  Yochai Konig,et al.  Remap: recursive estimation and maximization of a posteriori probabilities in transition-based speech recognition , 1996 .

[77]  Andrew W. Moore,et al.  Reinforcement Learning: A Survey , 1996, J. Artif. Intell. Res..

[78]  Yoshua Bengio,et al.  Input-output HMMs for sequence processing , 1996, IEEE Trans. Neural Networks.

[79]  Wenju Liu,et al.  Planning in Stochastic Domains: Problem Characteristics and Approximation , 1996 .

[80]  Andrew McCallum,et al.  Reinforcement learning with selective perception and hidden state , 1996 .

[81]  Mike Schuster,et al.  Bi-directional recurrent neural networks for speech recognition , 1996 .

[82]  Craig Boutilier,et al.  Computing Optimal Policies for Partially Observable Decision Processes Using Compact Representations , 1996, AAAI/IAAI, Vol. 2.

[83]  Mei-Yuh Hwang,et al.  Speech Recognition Using Hidden Markov Models: A CMU Perspective , 1996 .

[84]  G. Casella,et al.  Rao-Blackwellisation of sampling schemes , 1996 .

[85]  Corso Elvezia Probabilistic Incremental Program Evolution , 1997 .

[86]  B. Greer,et al.  High Performance Software on Intel Pentium Pro Processors or Micro-Ops to TeraFLOPS , 1997, ACM/IEEE SC 1997 Conference (SC'97).

[87]  Jürgen Schmidhuber,et al.  Long Short-Term Memory , 1997, Neural Computation.

[88]  Richard Washington,et al.  BI-POMDP: Bounded, Incremental, Partially-Observable Markov-Model Planning , 1997, ECP.

[89]  Shigenobu Kobayashi,et al.  Reinforcement Learning in POMDPs with Function Approximation , 1997, ICML.

[90]  Rafal Salustowicz,et al.  Probabilistic Incremental Program Evolution , 1997, Evolutionary Computation.

[91]  Milos Hauskrecht,et al.  Incremental Methods for Computing Bounds in Partially Observable Markov Decision Processes , 1997, AAAI/IAAI.

[92]  Richard M. Stern,et al.  The 1997 CMU Sphinx-3 English Broadcast News Transcription System , 1997 .

[93]  Mikko Kurimo,et al.  Training mixture density HMMs with SOM and LVQ , 1997, Comput. Speech Lang..

[94]  Ronen I. Brafman,et al.  A Heuristic Variable Grid Solution Method for POMDPs , 1997, AAAI/IAAI.

[95]  Louis C. W. Pols,et al.  Psycho-acoustics and Speech Perception , 1997 .

[96]  Sarel van Vuuren,et al.  Improved neural network training of inter-word context units for connected digit recognition , 1998, Proceedings of the 1998 IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP '98 (Cat. No.98CH36181).

[97]  Jack J. Dongarra,et al.  Automatically Tuned Linear Algebra Software , 1998, Proceedings of the IEEE/ACM SC98 Conference.

[98]  John Loch,et al.  Using Eligibility Traces to Find the Best Memoryless Policy in Partially Observable Markov Decision Processes , 1998, ICML.

[99]  Shigenobu Kobayashi,et al.  Reinforcement learning for continuous action using stochastic gradient ascent , 1998 .

[100]  Akira Hayashi,et al.  A Reinforcement Learning Algorithm in Partially Observable Environments Using Short-Term Memory , 1998, NIPS.

[101]  Mark D. Pendrith,et al.  An Analysis of Direct Reinforcement Learning in Non-Markovian Domains , 1998, ICML.

[102]  Xavier Boyen,et al.  Tractable Inference for Complex Stochastic Processes , 1998, UAI.

[103]  Eric A. Hansen,et al.  Solving POMDPs by Searching in Policy Space , 1998, UAI.

[104]  Mithuna Thottethodi,et al.  Tuning Strassen's Matrix Multiplication for Memory Efficiency , 1998, Proceedings of the IEEE/ACM SC98 Conference.

[105]  Andrew W. Moore,et al.  Gradient Descent for General Reinforcement Learning , 1998, NIPS.

[106]  Li Deng,et al.  A dynamic, feature-based approach to the interface between phonology and phonetics for speech modeling and recognition , 1998, Speech Commun..

[107]  Satinder P. Singh,et al.  Experimental Results on Learning Stochastic Memoryless Policies for Partially Observable Markov Decision Processes , 1998, NIPS.

[108]  David Haussler,et al.  Exploiting Generative Models in Discriminative Classifiers , 1998, NIPS.

[109]  Ilse C. F. Ipsen,et al.  THE IDEA BEHIND KRYLOV METHODS , 1998 .

[110]  Balaraman Ravindran,et al.  Improved Switching among Temporally Abstract Actions , 1998, NIPS.

[111]  Anne Condon,et al.  On the Undecidability of Probabilistic Planning and Infinite-Horizon Partially Observable Markov Decision Problems , 1999, AAAI/IAAI.

[112]  Kee-Eung Kim,et al.  Solving POMDPs by Searching the Space of Finite Policies , 1999, UAI.

[113]  Brian Sallans,et al.  Learning Factored Representations for Partially Observable Markov Decision Processes , 1999, NIPS.

[114]  Leslie Pack Kaelbling,et al.  Learning Policies with External Memory , 1999, ICML.

[115]  David A. McAllester,et al.  Approximate Planning for Factored POMDPs using Belief State Simplification , 1999, UAI.

[116]  H. Ney The Use of the Maximum Likelihood Criterion in Language Modelling , 1999 .

[117]  Terrence L. Fine Feedforward Neural Network Methodology , 1999, Information Science and Statistics.

[118]  Peter L. Bartlett,et al.  Neural Network Learning - Theoretical Foundations , 1999 .

[119]  John J. Grefenstette,et al.  Evolutionary Algorithms for Reinforcement Learning , 1999, J. Artif. Intell. Res..

[120]  Jesse Hoey,et al.  SPUDD: Stochastic Planning using Decision Diagrams , 1999, UAI.

[121]  Yishay Mansour,et al.  Policy Gradient Methods for Reinforcement Learning with Function Approximation , 1999, NIPS.

[122]  Daphne Koller,et al.  Reinforcement Learning Using Approximate Belief States , 1999, NIPS.

[123]  Enrico Gobbetti,et al.  Encyclopedia of Electrical and Electronics Engineering , 1999 .

[124]  Jean-Paul Haton,et al.  Connectionist and Hybrid Models for Automatic Speech Recognition , 1999 .

[125]  Kee-Eung Kim,et al.  Learning Finite-State Controllers for Partially Observable Environments , 1999, UAI.

[126]  Mike Schuster,et al.  On supervised learning from sequential data with applications for speech regognition , 1999 .

[127]  Sebastian Thrun,et al.  Monte Carlo POMDPs , 1999, NIPS.

[128]  Richard S. Sutton,et al.  Open Theoretical Questions in Reinforcement Learning , 1999, EuroCOLT.

[129]  Thomas G. Dietterich An Overview of MAXQ Hierarchical Reinforcement Learning , 2000, SARA.

[130]  Craig Boutilier,et al.  Value-Directed Belief State Approximation for POMDPs , 2000, UAI.

[131]  Daphne Koller,et al.  Policy Iteration for Factored MDPs , 2000, UAI.

[132]  Douglas Aberdeen,et al.  92¢ /MFlops/s, Ultra-Large-Scale Neural-Network Training on a PIII Cluster , 2000, ACM/IEEE SC 2000 Conference (SC'00).

[133]  Milos Hauskrecht,et al.  Value-Function Approximations for Partially Observable Markov Decision Processes , 2000, J. Artif. Intell. Res..

[134]  Raymond P. LeBeau,et al.  High-Cost CFD on a Low-Cost Cluster , 2000, ACM/IEEE SC 2000 Conference (SC'00).

[135]  Geoffrey J. Gordon Reinforcement Learning with Function Approximation Converges to a Region , 2000, NIPS.

[136]  Thomas G. Dietterich,et al.  A POMDP Approximation Algorithm That Anticipates the Need to Observe , 2000, PRICAI.

[137]  Sridhar Mahadevan,et al.  Hierarchical Memory-Based Reinforcement Learning , 2000, NIPS.

[138]  Doina Precup,et al.  Temporal abstraction in reinforcement learning , 2000, ICML 2000.

[139]  Leslie Pack Kaelbling,et al.  Adaptive Importance Sampling for Estimation in Structured Domains , 2000, UAI.

[140]  Peter L. Bartlett,et al.  Reinforcement Learning in POMDP's via Direct Gradient Ascent , 2000, ICML.

[141]  Judy Goldsmith,et al.  Nonapproximability Results for Partially Observable Markov Decision Processes , 2011, Universität Trier, Mathematik/Informatik, Forschungsbericht.

[142]  Andrew McCallum,et al.  Maximum Entropy Markov Models for Information Extraction and Segmentation , 2000, ICML.

[143]  Alain Dutech,et al.  Solving POMDPs Using Selected Past Events , 2000, ECAI.

[144]  Michael I. Jordan,et al.  PEGASUS: A policy search method for large MDPs and POMDPs , 2000, UAI.

[145]  Kee-Eung Kim,et al.  Approximate Solutions to Factored Markov Decision Processes via Greedy Search in the Space of Finite State Controllers , 2000, AIPS.

[146]  P. Lanzi,et al.  Adaptive Agents with Reinforcement Learning and Internal Memory , 2000 .

[147]  J. Tsitsiklis,et al.  Gradient-Based Optimization of Markov Reward Processes: Practical Variants , 2000 .

[148]  Zhengzhu Feng,et al.  Dynamic Programming for POMDPs Using a Factored State Representation , 2000, AIPS.

[149]  Katia P. Sycara,et al.  Evolutionary Search, Stochastic Policies with Memory, and Reinforcement Learning with Hidden State , 2001, ICML.

[150]  Peter L. Bartlett,et al.  Infinite-Horizon Policy-Gradient Estimation , 2001, J. Artif. Intell. Res..

[151]  Weihong Zhang,et al.  Speeding Up the Convergence of Value Iteration in Partially Observable Markov Decision Processes , 2011, J. Artif. Intell. Res..

[152]  Shie Mannor,et al.  Learning Embedded Maps of Markov Processes , 2001, ICML.

[153]  Sebastian Thrun,et al.  Integrating value functions and policy search for continuous Markov Decision Processes , 2001, NIPS 2001.

[154]  Craig Boutilier,et al.  Value-directed sampling methods for monitoring POMDPs , 2001, UAI 2001.

[155]  Nicolas Meuleau,et al.  Exploration in Gradient-Based Reinforcement Learning , 2001 .

[156]  Craig Boutilier,et al.  Vector-space Analysis of Belief-state Approximation for POMDPs , 2001, UAI.

[157]  Ronald E. Parr,et al.  Solving Factored POMDPs with Linear Value Functions , 2001 .

[158]  Lex Weaver,et al.  The Optimal Reward Baseline for Gradient-Based Reinforcement Learning , 2001, UAI.

[159]  Jürgen Schmidhuber,et al.  Market-Based Reinforcement Learning in Partially Observable Worlds , 2001, ICANN.

[160]  Andrew Tridgell,et al.  Reinforcement learning and chess , 2001 .

[161]  Lex Weaver,et al.  A Multi-Agent Policy-Gradient Approach to Network Routing , 2001, ICML.

[162]  Olivier Buffet,et al.  Multi-Agent Systems by Incremental Gradient Reinforcement Learning , 2001, IJCAI.

[163]  Carlos Guestrin,et al.  Max-norm Projections for Factored MDPs , 2001, IJCAI.

[164]  Andrew W. Moore,et al.  Direct Policy Search using Paired Statistical Tests , 2001, ICML.

[165]  Yuefan Deng,et al.  New trends in high performance computing , 2001, Parallel Comput..

[166]  P. N. Paraskevopoulos,et al.  Modern Control Engineering , 2001 .

[167]  Sridhar Mahadevan,et al.  Continuous-Time Hierarchical Reinforcement Learning , 2001, ICML.

[168]  Peter L. Bartlett,et al.  Experiments with Infinite-Horizon, Policy-Gradient Estimation , 2001, J. Artif. Intell. Res..

[169]  Jack J. Dongarra,et al.  Automated empirical optimizations of software and the ATLAS project , 2001, Parallel Comput..

[170]  Douglas Aberdeen,et al.  Emmerald: a fast matrix–matrix multiply using Intel's SSE instructions , 2001, Concurr. Comput. Pract. Exp..

[171]  Christian R. Shelton,et al.  Policy Improvement for POMDPs Using Normalized Importance Sampling , 2001, UAI.

[172]  Christian R. Shelton,et al.  Importance sampling for reinforcement learning with multiple objectives , 2001 .

[173]  Andrew McCallum,et al.  Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data , 2001, ICML.

[174]  Douglas Aberdeen,et al.  Scalable Internal-State Policy-Gradient Methods for POMDPs , 2002, ICML.

[175]  John K. Slaney,et al.  Anytime State-Based Solution Methods for Decision Processes with non-Markovian Rewards , 2002, UAI.

[176]  Gerald DeJong,et al.  Reinforcement Learning and Shaping: Encouraging Intended Behaviors , 2002, ICML.

[177]  Peter L. Bartlett,et al.  Estimation and Approximation Bounds for Gradient-Based Reinforcement Learning , 2000, J. Comput. Syst. Sci..

[178]  Leonid Peshkin,et al.  Learning from Scarce Experience , 2002, ICML.

[179]  Peter L. Bartlett,et al.  Model Selection and Error Estimation , 2000, Machine Learning.

[180]  Lawrence K. Saul,et al.  Markov Processes on Curves , 2000, Machine Learning.

[181]  Sridhar Mahadevan,et al.  Hierarchical Multiagent Reinforcement Learning , 2004 .

[182]  Ronald J. Williams Simple statistical gradient-following algorithms for connectionist reinforcement learning , 2004, Machine Learning.

[183]  Yishay Mansour,et al.  A Sparse Sampling Algorithm for Near-Optimal Planning in Large Markov Decision Processes , 1999, Machine Learning.

[184]  Richard S. Sutton,et al.  Reinforcement Learning: An Introduction , 2005, IEEE Transactions on Neural Networks.