Recurrent Neural-Linear Posterior Sampling for Nonstationary Contextual Bandits

An agent in a non-stationary contextual bandit problem should balance between exploration and the exploitation of (periodic or structured) patterns present in its previous experiences. Handcrafting an appropriate historical context is an attractive alternative to transform a non-stationary problem into a stationary problem that can be solved efficiently. However, even a carefully designed historical context may introduce spurious relationships or lack a convenient representation of crucial information. In order to address these issues, we propose an approach that learns to represent the relevant context for a decision based solely on the raw history of interactions between the agent and the environment. This approach relies on a combination of features extracted by recurrent neural networks with a contextual linear bandit algorithm based on posterior sampling. Our experiments on a diverse selection of contextual and non-contextual non-stationary problems show that our recurrent approach consistently outperforms its feedforward counterpart, which requires handcrafted historical contexts, while being more widely applicable than conventional non-stationary bandit algorithms.

[1]  Radford M. Neal Pattern Recognition and Machine Learning , 2007, Technometrics.

[2]  Jürgen Schmidhuber,et al.  Learning to Forget: Continual Prediction with LSTM , 2000, Neural Computation.

[3]  Olivier Capp'e,et al.  Algorithms for Non-Stationary Generalized Linear Bandits , 2020, ArXiv.

[4]  Peter Auer,et al.  Using Confidence Bounds for Exploitation-Exploration Trade-offs , 2003, J. Mach. Learn. Res..

[5]  Djallel Bouneffouf,et al.  A Survey on Practical Applications of Multi-Armed and Contextual Bandits , 2019, ArXiv.

[6]  Alessandro Lazaric,et al.  Linear Thompson Sampling Revisited , 2016, AISTATS.

[7]  Shipra Agrawal,et al.  Thompson Sampling for Contextual Bandits with Linear Payoffs , 2012, ICML.

[8]  Olivier Cappé,et al.  Weighted Linear Bandits for Non-Stationary Environments , 2019, NeurIPS.

[9]  Zeb Kurth-Nelson,et al.  Learning to reinforcement learn , 2016, CogSci.

[10]  Peter L. Bartlett,et al.  RL$^2$: Fast Reinforcement Learning via Slow Reinforcement Learning , 2016, ArXiv.

[11]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[12]  Guilherme A. Barreto,et al.  Short-term memory mechanisms in neural network learning of robot navigation tasks: A case study , 2009, 2009 6th Latin American Robotics Symposium (LARS 2009).

[13]  Csaba Szepesvári,et al.  Improved Algorithms for Linear Stochastic Bandits , 2011, NIPS.

[14]  David Simchi-Levi,et al.  Learning to Optimize under Non-Stationarity , 2018, AISTATS.

[15]  Christopher M. Bishop,et al.  Pattern Recognition and Machine Learning (Information Science and Statistics) , 2006 .

[16]  Eric Moulines,et al.  On Upper-Confidence Bound Policies for Switching Bandit Problems , 2011, ALT.

[17]  Jasper Snoek,et al.  Deep Bayesian Bandits Showdown: An Empirical Comparison of Bayesian Deep Networks for Thompson Sampling , 2018, ICLR.

[18]  Zheng Wen,et al.  Nearly Optimal Adaptive Procedure with Change Detection for Piecewise-Stationary Bandit , 2018, AISTATS.

[19]  W. R. Thompson ON THE LIKELIHOOD THAT ONE UNKNOWN PROBABILITY EXCEEDS ANOTHER IN VIEW OF THE EVIDENCE OF TWO SAMPLES , 1933 .

[20]  Jürgen Schmidhuber,et al.  Long Short-Term Memory , 1997, Neural Computation.

[21]  Fang Liu,et al.  A Change-Detection based Framework for Piecewise-stationary Multi-Armed Bandit Problem , 2017, AAAI.