Stochastic bandits with pathwise constraints

We consider the problem of stochastic bandits, with the goal of maximizing a reward while satisfying pathwise constraints. The motivation for this problem comes from cognitive radio networks, in which agents need to choose between different transmission profiles to maximize throughput under certain operational constraints such as limited average power. Stochastic bandits serve as a natural model for an unknown, stationary environment. We propose an algorithm, based on a steering approach, and analyze its regret with respect to the optimal stationary policy that knows the statistics of the different arms.

[1]  T. L. Lai Andherbertrobbins Asymptotically Efficient Adaptive Allocation Rules , 2022 .

[2]  I. M. Jacobs,et al.  Principles of Communication Engineering , 1965 .

[3]  Joseph Mitola,et al.  Cognitive radio: making software radios more personal , 1999, IEEE Wirel. Commun..

[4]  R. Agrawal Sample mean based index policies by O(log n) regret for the multi-armed bandit problem , 1995, Advances in Applied Probability.

[5]  Peter Auer,et al.  Finite-time Analysis of the Multiarmed Bandit Problem , 2002, Machine Learning.

[6]  Shie Mannor,et al.  Stochastic bandits with pathwise constraints , 2012 .

[7]  Keith W. Ross,et al.  Randomized and Past-Dependent Policies for Markov Decision Processes with Multiple Constraints , 1989, Oper. Res..

[8]  Armand M. Makowski,et al.  Implementation Issues for Markov Decision Processes , 1988 .

[9]  Shie Mannor,et al.  A Geometric Approach to Multi-Criterion Reinforcement Learning , 2004, J. Mach. Learn. Res..

[10]  Csaba Szepesvári,et al.  Tuning Bandit Algorithms in Stochastic Environments , 2007, ALT.

[11]  P. W. Jones,et al.  Bandit Problems, Sequential Allocation of Experiments , 1987 .

[12]  John N. Tsitsiklis,et al.  Online Learning with Sample Path Constraints , 2009, J. Mach. Learn. Res..

[13]  Armand M. Makowski,et al.  A class of steering policies under a recurrence condition , 1988, Proceedings of the 27th IEEE Conference on Decision and Control.

[14]  R. F.,et al.  Mathematical Statistics , 1944, Nature.

[15]  Wassim Jouini,et al.  Multi-armed bandit based policies for cognitive radio's decision making issues , 2009, 2009 3rd International Conference on Signals, Circuits and Systems (SCS).