Kernel-Based Reinforcement Learning in Average-Cost Problems: An Application to Optimal Portfolio Choice

Many approaches to reinforcement learning combine neural networks or other parametric function approximators with a form of temporal-difference learning to estimate the value function of a Markov Decision Process. A significant disadvantage of those procedures is that the resulting learning algorithms are frequently unstable. In this work, we present a new, kernel-based approach to reinforcement learning which overcomes this difficulty and provably converges to a unique solution. By contrast to existing algorithms, our method can also be shown to be consistent in the sense that its costs converge to the optimal costs asymptotically. Our focus is on learning in an average-cost framework and on a practical application to the optimal portfolio choice problem.

[1]  E. Nadaraya On Estimating Regression , 1964 .

[2]  G. S. Watson,et al.  Smooth regression analysis , 1964 .

[3]  J. Yackel Limit theorems for semi-Markov processes , 1966 .

[4]  C. J. Stone,et al.  Consistent Nonparametric Regression , 1977 .

[5]  K. Athreya,et al.  Limit theorems for semi-Markov processes , 1978, Bulletin of the Australian Mathematical Society.

[6]  Luc Devroye,et al.  The uniform convergence of nearest neighbor regression function estimators and their application in optimization , 1978, IEEE Trans. Inf. Theory.

[7]  C. J. Stone,et al.  Optimal Global Rates of Convergence for Nonparametric Regression , 1982 .

[8]  Leo Breiman,et al.  Classification and Regression Trees , 1984 .

[9]  Harald Niederreiter,et al.  Random number generation and Quasi-Monte Carlo methods , 1992, CBMS-NSF regional conference series in applied mathematics.

[10]  Richard L. Tweedie,et al.  Markov Chains and Stochastic Stability , 1993, Communications and Control Engineering Series.

[11]  S. Hu THE STRONG UNIFORM CONSISTENCY OF KERNEL DENSITY ESTIMATES FOR φ—MIXING SAMPLE , 1993 .

[12]  M. K. Ghosh,et al.  Discrete-time controlled Markov processes with average cost criterion: a survey , 1993 .

[13]  Andrew W. Moore,et al.  Generalization in Reinforcement Learning: Safely Approximating the Value Function , 1994, NIPS.

[14]  Martin L. Puterman,et al.  Markov Decision Processes: Discrete Stochastic Dynamic Programming , 1994 .

[15]  John Rust Using Randomization to Break the Curse of Dimensionality , 1997 .

[16]  G. Lugosi,et al.  On the Strong Universal Consistency of Nearest Neighbor Regression Function Estimates , 1994 .

[17]  Dimitri P. Bertsekas,et al.  Dynamic Programming and Optimal Control, Two Volume Set , 1995 .

[18]  Dimitri P. Bertsekas,et al.  Dynamic Programming and Optimal Control, Two Volume Set , 1995 .

[19]  Geoffrey J. Gordon Stable Fitted Reinforcement Learning , 1995, NIPS.

[20]  Jianqing Fan Local Polynomial Modelling and Its Applications: Monographs on Statistics and Applied Probability 66 , 1996 .

[21]  László Györfi,et al.  A Probabilistic Theory of Pattern Recognition , 1996, Stochastic Modelling and Applied Probability.

[22]  John N. Tsitsiklis,et al.  Neuro-Dynamic Programming , 1996, Encyclopedia of Machine Learning.

[23]  Sean P. Meyn The policy iteration algorithm for average reward Markov decision processes with general state space , 1997, IEEE Trans. Autom. Control..

[24]  Xi-Ren Cao,et al.  Perturbation realization, potentials, and sensitivity analysis of Markov processes , 1997, IEEE Trans. Autom. Control..

[25]  John Rust A Comparison of Policy Iteration Methods for Solving Continuous-State, Infinite-Horizon Markovian Decision Problems Using Random, Quasi-Random, and Deterministic Discretizations , 1997 .

[26]  H. Müller,et al.  Local Polynomial Modeling and Its Applications , 1998 .

[27]  John N. Tsitsiklis,et al.  Average cost temporal-difference learning , 1997, Proceedings of the 36th IEEE Conference on Decision and Control.

[28]  Geoffrey J. Gordon,et al.  Approximate solutions to markov decision processes , 1999 .

[29]  John N. Tsitsiklis,et al.  Actor-Critic Algorithms , 1999, NIPS.

[30]  Trevor J. Hastie,et al.  Optimal Kernel Shapes for Local Linear Regression , 1999, NIPS.

[31]  Yishay Mansour,et al.  Policy Gradient Methods for Reinforcement Learning with Function Approximation , 1999, NIPS.

[32]  V. Borkar A LEARNING ALGORITHM FOR DISCRETE-TIME STOCHASTIC CONTROL , 2000, Probability in the Engineering and Informational Sciences.

[33]  H. Kushner Numerical Methods for Stochastic Control Problems in Continuous Time , 2000 .

[34]  Vladimir N. Vapnik,et al.  The Nature of Statistical Learning Theory , 2000, Statistics for Engineering and Information Science.

[35]  P. Glynn,et al.  Hoeffding's inequality for uniformly ergodic Markov chains , 2002 .