Thompson Sampling for Complex Online Problems

We consider stochastic multi-armed bandit problems with complex actions over a set of basic arms, where the decision maker plays a complex action rather than a basic arm in each round. The reward of the complex action is some function of the basic arms' rewards, and the feedback observed may not necessarily be the reward perarm. For instance, when the complex actions are subsets of the arms, we may only observe the maximum reward over the chosen subset. Thus, feedback across complex actions may be coupled due to the nature of the reward function. We prove a frequentist regret bound for Thompson sampling in a very general setting involving parameter, action and observation spaces and a likelihood function over them. The bound holds for discretely-supported priors over the parameter space without additional structural properties such as closed-form posteriors, conjugate prior structure or independence across arms. The regret bound scales logarithmically with time but, more importantly, with an improved constant that non-trivially captures the coupling across complex actions due to the structure of the rewards. As applications, we derive improved regret bounds for classes of complex bandit problems involving selecting subsets of arms, including the first nontrivial regret bounds for nonlinear MAX reward feedback from subsets. Using particle filters for computing posterior distributions which lack an explicit closed-form, we present numerical results for the performance of Thompson sampling for subset-selection and job scheduling problems.

[1]  Shipra Agrawal,et al.  Analysis of Thompson Sampling for the Multi-armed Bandit Problem , 2011, COLT.

[2]  Shipra Agrawal,et al.  Thompson Sampling for Contextual Bandits with Linear Payoffs , 2012, ICML.

[3]  Rémi Munos,et al.  Thompson Sampling: An Asymptotically Optimal Finite-Time Analysis , 2012, ALT.

[4]  Daniel A. Braun,et al.  A Minimum Relative Entropy Principle for Learning and Acting , 2008, J. Artif. Intell. Res..

[5]  Branko Ristic,et al.  Beyond the Kalman Filter: Particle Filters for Tracking Applications , 2004 .

[6]  J. Bather,et al.  Multi‐Armed Bandit Allocation Indices , 1990 .

[7]  Csaba Szepesvári,et al.  Improved Algorithms for Linear Stochastic Bandits , 2011, NIPS.

[8]  Aurélien Garivier,et al.  The KL-UCB Algorithm for Bounded Stochastic Bandits and Beyond , 2011, COLT.

[9]  Rudolf Ahlswede,et al.  Maximum Number of Constant Weight Vertices of the Unit n-Cube Contained in a k-Dimensional Subspace , 2003, Comb..

[10]  Lihong Li,et al.  An Empirical Evaluation of Thompson Sampling , 2011, NIPS.

[11]  Rémi Munos,et al.  Thompson Sampling for 1-Dimensional Exponential Family Bandits , 2013, NIPS.

[12]  Benjamin Van Roy,et al.  (More) Efficient Reinforcement Learning via Posterior Sampling , 2013, NIPS.

[13]  W. R. Thompson ON THE LIKELIHOOD THAT ONE UNKNOWN PROBABILITY EXCEEDS ANOTHER IN VIEW OF THE EVIDENCE OF TWO SAMPLES , 1933 .

[14]  Thomas P. Hayes,et al.  Stochastic Linear Optimization under Bandit Feedback , 2008, COLT.

[15]  Andreas Krause,et al.  Information-Theoretic Regret Bounds for Gaussian Process Optimization in the Bandit Setting , 2009, IEEE Transactions on Information Theory.

[16]  Peter Auer,et al.  Finite-time Analysis of the Multiarmed Bandit Problem , 2002, Machine Learning.

[17]  Csaba Szepesvári,et al.  –armed Bandits , 2022 .

[18]  Jean-Yves Audibert,et al.  Minimax Policies for Adversarial and Stochastic Bandits. , 2009, COLT 2009.

[19]  Geoffrey I. Webb,et al.  Encyclopedia of Machine Learning , 2011, Encyclopedia of Machine Learning.

[20]  T. L. Lai Andherbertrobbins Asymptotically Efficient Adaptive Allocation Rules , 2022 .

[21]  Benjamin Van Roy,et al.  Learning to Optimize via Posterior Sampling , 2013, Math. Oper. Res..

[22]  Timothy J. Robinson,et al.  Sequential Monte Carlo Methods in Practice , 2003 .

[23]  Peter Auer,et al.  Using Confidence Bounds for Exploitation-Exploration Trade-offs , 2003, J. Mach. Learn. Res..

[24]  Steven L. Scott,et al.  A modern Bayesian look at the multi-armed bandit , 2010 .

[25]  Gábor Lugosi,et al.  Minimax Policies for Combinatorial Prediction Games , 2011, COLT.