Contextual Markov Decision Processes

We consider a planning problem where the dynamics and rewards of the environment depend on a hidden static parameter referred to as the context. The objective is to learn a strategy that maximizes the accumulated reward across all contexts. The new model, called Contextual Markov Decision Process (CMDP), can model a customer's behavior when interacting with a website (the learner). The customer's behavior depends on gender, age, location, device, etc. Based on that behavior, the website objective is to determine customer characteristics, and to optimize the interaction between them. Our work focuses on one basic scenario--finite horizon with a small known number of possible contexts. We suggest a family of algorithms with provable guarantees that learn the underlying models and the latent contexts, and optimize the CMDPs. Bounds are obtained for specific naive implementations, and extensions of the framework are discussed, laying the ground for future research.

[1]  R. Vidal A TUTORIAL ON SUBSPACE CLUSTERING , 2010 .

[2]  L. A. Pipes,et al.  Mathematical Methods in the Physical Sciences. , 1967 .

[3]  Michael Kearns,et al.  Near-Optimal Reinforcement Learning in Polynomial Time , 1998, Machine Learning.

[4]  Richard L. Tweedie,et al.  Markov Chains and Stochastic Stability , 1993, Communications and Control Engineering Series.

[5]  Shigeo Abe DrEng Pattern Classification , 2001, Springer London.

[6]  Shie Mannor,et al.  Latent Bandits , 2014, ICML.

[7]  Daniel Kuhn,et al.  Robust Markov Decision Processes , 2013, Math. Oper. Res..

[8]  Peter Auer,et al.  Near-optimal Regret Bounds for Reinforcement Learning , 2008, J. Mach. Learn. Res..

[9]  Toon Calders,et al.  Predicting Current User Intent with Contextual Markov Models , 2013, 2013 IEEE 13th International Conference on Data Mining Workshops.

[10]  Sébastien Bubeck,et al.  Regret Analysis of Stochastic and Nonstochastic Multi-armed Bandit Problems , 2012, Found. Trends Mach. Learn..

[11]  Eduardo F. Morales,et al.  An Introduction to Reinforcement Learning , 2011 .

[12]  D. J. White,et al.  A Survey of Applications of Markov Decision Processes , 1993 .

[13]  Phuong Nguyen,et al.  Competing with an Infinite Set of Models in Reinforcement Learning , 2013, AISTATS.

[14]  Mitsuo Kawato,et al.  Multiple Model-Based Reinforcement Learning , 2002, Neural Computation.

[15]  J. Langford,et al.  The Epoch-Greedy algorithm for contextual multi-armed bandits , 2007, NIPS 2007.

[16]  Rémi Munos,et al.  Selecting the State-Representation in Reinforcement Learning , 2011, NIPS.

[17]  David Silver,et al.  Concurrent Reinforcement Learning from Customer Interactions , 2013, ICML.

[18]  Aaron F. Bobick,et al.  Parametric Hidden Markov Models for Gesture Recognition , 1999, IEEE Trans. Pattern Anal. Mach. Intell..

[19]  T. L. Lai Andherbertrobbins Asymptotically Efficient Adaptive Allocation Rules , 2022 .

[20]  Martin L. Puterman,et al.  Markov Decision Processes: Discrete Stochastic Dynamic Programming , 1994 .

[21]  Jan Peters,et al.  Nonamemanuscript No. (will be inserted by the editor) Reinforcement Learning to Adjust Parametrized Motor Primitives to , 2011 .

[22]  Laurent El Ghaoui,et al.  Robust Control of Markov Decision Processes with Uncertain Transition Matrices , 2005, Oper. Res..

[23]  Michael Kuperberg,et al.  Markov Models , 2017, Arch. Formal Proofs.

[24]  Arnd Kohrs,et al.  Improving collaborative filtering for new-users by smart object selection , 2001 .

[25]  Thomas G. Dietterich What is machine learning? , 2020, Archives of Disease in Childhood.

[26]  Nir Ailon,et al.  Streaming k-means approximation , 2009, NIPS.

[27]  E. Ordentlich,et al.  Inequalities for the L1 Deviation of the Empirical Distribution , 2003 .

[28]  Matthew Richardson,et al.  Predicting clicks: estimating the click-through rate for new ads , 2007, WWW '07.

[29]  Shie Mannor,et al.  Learning Multiple Models via Regularized Weighting , 2013, NIPS.

[30]  Sergei Vassilvitskii,et al.  k-means++: the advantages of careful seeding , 2007, SODA '07.

[31]  D. Aberdeen,et al.  A ( Revised ) Survey of Approximate Methods for Solving Partially Observable Markov Decision Processes , 2003 .

[32]  Thierry Artières,et al.  Handling signal variability with contextual markovian models , 2014, Pattern Recognit. Lett..

[33]  S. Eddy Hidden Markov models. , 1996, Current opinion in structural biology.

[34]  Sariel Har-Peled,et al.  On coresets for k-means and k-median clustering , 2004, STOC '04.