Local Bandit Approximation for Optimal Learning Problems

In general, procedures for determining Bayes-optimal adaptive controls for Markov decision processes (MDP's) require a prohibitive amount of computation-the optimal learning problem is intractable. This paper proposes an approximate approach in which bandit processes are used to model, in a certain "local" sense, a given MDP. Bandit processes constitute an important subclass of MDP's, and have optimal learning strategies (defined in terms of Gittins indices) that can be computed relatively efficiently. Thus, one scheme for achieving approximately-optimal learning for general MDP's proceeds by taking actions suggested by strategies that are optimal with respect to local bandit models.

[1]  Robert E. Kalaba,et al.  On adaptive control processes , 1959 .

[2]  D. Naidu,et al.  Optimal Control Systems , 2018 .

[3]  A. G. Butkovskiy,et al.  Optimal control of systems , 1966 .

[4]  J. MacQueen A MODIFIED DYNAMIC PROGRAMMING METHOD FOR MARKOVIAN DECISION PROBLEMS , 1966 .

[5]  J. K. Satia,et al.  Markovian Decision Processes with Uncertain Transition Probabilities , 1973, Oper. Res..

[6]  Yaakov Bar-Shalom,et al.  Caution, Probing, and the Value of Information in the Control of Uncertain Systems , 1976 .

[7]  V. Borkar,et al.  Adaptive control of Markov chains, I: Finite parameter set , 1979 .

[8]  V. Borkar,et al.  Adaptive control of Markov chains, I: Finite parameter set , 1979, 1979 18th IEEE Conference on Decision and Control including the Symposium on Adaptive Processes.

[9]  J. Gittins Bandit processes and dynamic allocation indices , 1979 .

[10]  D. R. Robinson Algorithms for evaluating the dynamic allocation index , 1982, Oper. Res. Lett..

[11]  Jean Walrand,et al.  Extensions of the multiarmed bandit problem: The discounted case , 1985 .

[12]  Michael N. Katehakis,et al.  The Multi-Armed Bandit Problem: Decomposition and Computation , 1987, Math. Oper. Res..

[13]  C. Watkins Learning from delayed rewards , 1989 .

[14]  J. Tsitsiklis A short proof of the Gittins index theorem , 1993, Proceedings of 32nd IEEE Conference on Decision and Control.

[15]  Michael O. Duff,et al.  Q-Learning for Bandit Problems , 1995, ICML.

[16]  P. Dayan,et al.  Exploration bonuses and dual control , 1996 .