Importance sampling for reinforcement learning with multiple objectives

This thesis considers three complications that arise from applying reinforcement learning to a real-world application. In the process of using reinforcement learning to build an adaptive electronic market-maker, we find the sparsity of data, the partial observability of the domain, and the multiple objectives of the agent to cause serious problems for existing reinforcement learning algorithms. We employ importance sampling (likelihood ratios) to achieve good performance in partially observable Markov decision processes with few data. Our importance sampling estimator requires no knowledge about the environment and places few restrictions on the method of collecting data. It can be used efficiently with reactive controllers, finite-state controllers, or policies with function approximation. We present theoretical analyses of the estimator and incorporate it into a reinforcement learning algorithm. Additionally, this method provides a complete return surface which can be used to balance multiple objectives dynamically. We demonstrate the need for multiple goals in a variety of applications and natural solutions based on our sampling method. The thesis concludes with example results from employing our algorithm to the domain of automated electronic market-making. Thesis Supervisor: Tomaso Poggio Title: Professor of Brain and Cognitive Science

[1]  T. Kloek,et al.  Bayesian estimates of equation system parameters, An application of integration by Monte Carlo , 1976 .

[2]  E. H. Clarke Incentives in public decision-making , 1980 .

[3]  Y. Amihud,et al.  Dealership market: Market-making with inventory , 1980 .

[4]  T. Ho,et al.  Optimal dealer pricing under transactions and return uncertainty , 1981 .

[5]  Reuven Y. Rubinstein,et al.  Simulation and the Monte Carlo method , 1981, Wiley series in probability and mathematical statistics.

[6]  T. Ho,et al.  The Dynamics of Dealer Markets Under Competition , 1983 .

[7]  Paul R. Milgrom,et al.  Bid, ask and transaction prices in a specialist market with heterogeneously informed traders , 1985 .

[8]  Maureen O'Hara,et al.  The Microeconomics of Market Making , 1986, Journal of Financial and Quantitative Analysis.

[9]  George E. P. Box,et al.  Empirical Model‐Building and Response Surfaces , 1988 .

[10]  M. Resnik Choices: An Introduction to Decision Theory , 1990 .

[11]  J. Geweke,et al.  Bayesian Inference in Econometric Models Using Monte Carlo Integration , 1989 .

[12]  Lawrence R. Rabiner,et al.  A tutorial on hidden Markov models and selected applications in speech recognition , 1989, Proc. IEEE.

[13]  William H. Press,et al.  Numerical recipes , 1990 .

[14]  Heekuck Oh,et al.  Neural Networks for Pattern Recognition , 1993, Adv. Comput..

[15]  Andrew W. Moore,et al.  Generalization in Reinforcement Learning: Safely Approximating the Value Function , 1994, NIPS.

[16]  Michael I. Jordan,et al.  Reinforcement Learning Algorithm for Partially Observable Markov Decision Problems , 1994, NIPS.

[17]  T. Hesterberg Weighted Average Importance Sampling and Defensive Mixture Distributions , 1995 .

[18]  A. Mas-Colell,et al.  Microeconomic Theory , 1995 .

[19]  Andrew W. Moore,et al.  Reinforcement Learning: A Survey , 1996, J. Artif. Intell. Res..

[20]  John N. Tsitsiklis,et al.  Neuro-Dynamic Programming , 1996, Encyclopedia of Machine Learning.

[21]  Michael L. Littman,et al.  Algorithms for Sequential Decision Making , 1996 .

[22]  Csaba Szepesvári,et al.  Multi-criteria Reinforcement Learning , 1998, ICML.

[23]  John Loch,et al.  Using Eligibility Traces to Find the Best Memoryless Policy in Partially Observable Markov Decision Processes , 1998, ICML.

[24]  Konkoly Thege Multi-criteria Reinforcement Learning , 1998 .

[25]  Andrew W. Moore,et al.  Gradient Descent for General Reinforcement Learning , 1998, NIPS.

[26]  Satinder P. Singh,et al.  Experimental Results on Learning Stochastic Memoryless Policies for Partially Observable Markov Decision Processes , 1998, NIPS.

[27]  Leslie Pack Kaelbling,et al.  Learning Policies with External Memory , 1999, ICML.

[28]  Yishay Mansour,et al.  Approximate Planning in Large POMDPs via Reusable Trajectories , 1999, NIPS.

[29]  John N. Tsitsiklis,et al.  Actor-Critic Algorithms , 1999, NIPS.

[30]  Yishay Mansour,et al.  Policy Gradient Methods for Reinforcement Learning with Function Approximation , 1999, NIPS.

[31]  Kee-Eung Kim,et al.  Learning Finite-State Controllers for Partially Observable Environments , 1999, UAI.

[32]  Christian R. Shelton,et al.  Balancing Multiple Sources of Reward in Reinforcement Learning , 2000, NIPS.

[33]  Andrew W. Moore,et al.  A Nonparametric Approach to Noisy and Costly Optimization , 2000, ICML.

[34]  Geoffrey J. Gordon Reinforcement Learning with Function Approximation Converges to a Region , 2000, NIPS.

[35]  Doina Precup,et al.  Eligibility Traces for Off-Policy Policy Evaluation , 2000, ICML.

[36]  Michael I. Jordan,et al.  PEGASUS: A policy search method for large MDPs and POMDPs , 2000, UAI.

[37]  Nicolas Meuleau,et al.  Exploration in Gradient-Based Reinforcement Learning , 2001 .

[38]  Leonid Peshkin,et al.  Bounds on Sample Size for Policy Evaluation in Markov Environments , 2001, COLT/EuroCOLT.

[39]  Peter Geibel,et al.  Reinforcement Learning with Bounded Risk , 2001, ICML.

[40]  Christian R. Shelton,et al.  An Electronic Market-maker , 2001 .

[41]  Sanjoy Dasgupta,et al.  Off-Policy Temporal Difference Learning with Function Approximation , 2001, ICML.

[42]  William H. Press,et al.  Numerical recipes in C , 2002 .