RL for Latent MDPs: Regret Guarantees and a Lower Bound

In this work, we consider the regret minimization problem for reinforcement learning in latent Markov Decision Processes (LMDP). In an LMDP, an MDP is randomly drawn from a set of $M$ possible MDPs at the beginning of the interaction, but the identity of the chosen MDP is not revealed to the agent. We first show that a general instance of LMDPs requires at least $\Omega((SA)^M)$ episodes to even approximate the optimal policy. Then, we consider sufficient assumptions under which learning good policies requires polynomial number of episodes. We show that the key link is a notion of separation between the MDP system dynamics. With sufficient separation, we provide an efficient algorithm with local guarantee, {\it i.e.,} providing a sublinear regret guarantee when we are given a good initialization. Finally, if we are given standard statistical sufficiency assumptions common in the Predictive State Representation (PSR) literature (e.g., Boots et al.) and a reachability assumption, we show that the need for initialization can be removed.

[1]  Roman Vershynin,et al.  Introduction to the non-asymptotic analysis of random matrices , 2010, Compressed Sensing.

[2]  Peter Buchholz,et al.  Computation of weighted sums of rewards for concurrent MDPs , 2018, Math. Methods Oper. Res..

[3]  Lihong Li,et al.  Sample Complexity of Multi-task Reinforcement Learning , 2013, UAI.

[4]  O. Cappé,et al.  On‐line expectation–maximization algorithm for latent data models , 2009 .

[5]  John Langford,et al.  PAC Reinforcement Learning with Rich Observations , 2016, NIPS.

[6]  Constantine Caramanis,et al.  EM Converges for a Mixture of Many Linear Regressions , 2019, AISTATS.

[7]  Brian T. Denton,et al.  Multi-model Markov decision processes , 2021, IISE Trans..

[8]  Nikos A. Vlassis,et al.  Perseus: Randomized Point-based Value Iteration for POMDPs , 2005, J. Artif. Intell. Res..

[9]  Byron Boots,et al.  An Online Spectral Learning Algorithm for Partially Observable Nonlinear Dynamical Systems , 2011, AAAI.

[10]  Nan Jiang,et al.  Contextual Decision Processes with low Bellman rank are PAC-Learnable , 2016, ICML.

[11]  S. Kakade,et al.  Sample-Efficient Reinforcement Learning of Undercomplete POMDPs , 2020, NeurIPS.

[12]  Emma Brunskill,et al.  A PAC RL Algorithm for Episodic POMDPs , 2016, AISTATS.

[13]  Nan Jiang,et al.  On Oracle-Efficient PAC RL with Rich Observations , 2018, NeurIPS.

[14]  Michael L. Littman,et al.  Memoryless policies: theoretical limitations and practical results , 1994 .

[15]  Reid G. Simmons,et al.  Heuristic Search Value Iteration for POMDPs , 2004, UAI.

[16]  John N. Tsitsiklis,et al.  The Complexity of Markov Decision Processes , 1987, Math. Oper. Res..

[17]  Nan Jiang,et al.  Markov Decision Processes with Continuous Side Information , 2017, ALT.

[18]  Rémi Munos,et al.  Minimax Regret Bounds for Reinforcement Learning , 2017, ICML.

[19]  Yao Liu,et al.  PAC Continuous State Online Multitask Reinforcement Learning with Identification , 2016, AAMAS.

[20]  Nan Jiang,et al.  Improving Predictive State Representations via Gradient Descent , 2016, AAAI.

[21]  Masoumeh T. Izadi,et al.  Sensitivity Analysis of POMDP Value Functions , 2009, 2009 International Conference on Machine Learning and Applications.

[22]  Aurélien Garivier,et al.  Explore First, Exploit Next: The True Shape of Regret in Bandit Problems , 2016, Math. Oper. Res..

[23]  Olivier Buffet,et al.  MOMDPs: A Solution for Modelling Adaptive Management Problems , 2012, AAAI.

[24]  Richard S. Sutton,et al.  Reinforcement Learning: An Introduction , 1998, IEEE Trans. Neural Networks.

[25]  Anima Anandkumar,et al.  A Method of Moments for Mixture Models and Hidden Markov Models , 2012, COLT.

[26]  Michael R. James,et al.  Predictive State Representations: A New Theory for Modeling Dynamical Systems , 2004, UAI.

[27]  Michael I. Jordan,et al.  PEGASUS: A policy search method for large MDPs and POMDPs , 2000, UAI.

[28]  Shie Mannor,et al.  Latent Bandits , 2014, ICML.

[29]  Constantine Caramanis,et al.  On the Minimax Optimality of the EM Algorithm for Learning Two-Component Mixed Linear Regression , 2020, AISTATS.

[30]  Sergei Vassilvitskii,et al.  k-means++: the advantages of careful seeding , 2007, SODA '07.

[31]  Shie Mannor,et al.  Contextual Markov Decision Processes , 2015, ArXiv.

[32]  Hongsheng Xi,et al.  Finding optimal memoryless policies of POMDPs under the expected average reward criterion , 2011, Eur. J. Oper. Res..

[33]  Kamyar Azizzadenesheli,et al.  Reinforcement Learning of POMDPs using Spectral Methods , 2016, COLT.

[34]  Geoffrey J. Gordon,et al.  Supervised Learning for Dynamical System Learning , 2015, NIPS.

[35]  Joelle Pineau,et al.  Anytime Point-Based Approximations for Large POMDPs , 2006, J. Artif. Intell. Res..

[36]  Anima Anandkumar,et al.  Tensor decompositions for learning latent variable models , 2012, J. Mach. Learn. Res..

[37]  Byron Boots,et al.  Closing the learning-planning loop with predictive state representations , 2009, Int. J. Robotics Res..

[38]  Michael I. Jordan,et al.  Reinforcement Learning Algorithm for Partially Observable Markov Decision Problems , 1994, NIPS.

[39]  Edward J. Sondik,et al.  The Optimal Control of Partially Observable Markov Processes over a Finite Horizon , 1973, Oper. Res..

[40]  Sham M. Kakade,et al.  A spectral algorithm for learning Hidden Markov Models , 2008, J. Comput. Syst. Sci..

[41]  Richard S. Sutton,et al.  Predictive Representations of State , 2001, NIPS.

[42]  Nan Jiang,et al.  Provably efficient RL with Rich Observations via Latent State Decoding , 2019, ICML.

[43]  Peter Stone,et al.  Transfer Learning for Reinforcement Learning Domains: A Survey , 2009, J. Mach. Learn. Res..

[44]  Peter Auer,et al.  Near-optimal Regret Bounds for Reinforcement Learning , 2008, J. Mach. Learn. Res..

[45]  Shuai Li,et al.  Online Clustering of Bandits , 2014, ICML.

[46]  Constantine Caramanis,et al.  The EM Algorithm gives Sample-Optimality for Learning Mixtures of Well-Separated Gaussians , 2020, COLT 2020.

[47]  V. N. Bogaevski,et al.  Matrix Perturbation Theory , 1991 .

[48]  Shuai Li,et al.  On Context-Dependent Clustering of Bandits , 2016, ICML.

[49]  Leslie Pack Kaelbling,et al.  Learning Policies for Partially Observable Environments: Scaling Up , 1997, ICML.