论文信息 - RL for Latent MDPs: Regret Guarantees and a Lower Bound - 字舞流文

RL for Latent MDPs: Regret Guarantees and a Lower Bound

In this work, we consider the regret minimization problem for reinforcement learning in latent Markov Decision Processes (LMDP). In an LMDP, an MDP is randomly drawn from a set of $M$ possible MDPs at the beginning of the interaction, but the identity of the chosen MDP is not revealed to the agent. We first show that a general instance of LMDPs requires at least $\Omega((SA)^M)$ episodes to even approximate the optimal policy. Then, we consider sufficient assumptions under which learning good policies requires polynomial number of episodes. We show that the key link is a notion of separation between the MDP system dynamics. With sufficient separation, we provide an efficient algorithm with local guarantee, {\it i.e.,} providing a sublinear regret guarantee when we are given a good initialization. Finally, if we are given standard statistical sufficiency assumptions common in the Predictive State Representation (PSR) literature (e.g., Boots et al.) and a reachability assumption, we show that the need for initialization can be removed.

Shie Mannor | Constantine Caramanis | Jeongyeol Kwon | Yonathan Efroni | Shie Mannor | C. Caramanis | Yonathan Efroni | Jeongyeol Kwon

[1] Roman Vershynin,et al. Introduction to the non-asymptotic analysis of random matrices , 2010, Compressed Sensing.

[2] Peter Buchholz,et al. Computation of weighted sums of rewards for concurrent MDPs , 2018, Math. Methods Oper. Res..

[3] Lihong Li,et al. Sample Complexity of Multi-task Reinforcement Learning , 2013, UAI.

[4] O. Cappé,et al. On‐line expectation–maximization algorithm for latent data models , 2009 .

[5] John Langford,et al. PAC Reinforcement Learning with Rich Observations , 2016, NIPS.

[6] Constantine Caramanis,et al. EM Converges for a Mixture of Many Linear Regressions , 2019, AISTATS.

[7] Brian T. Denton,et al. Multi-model Markov decision processes , 2021, IISE Trans..

[8] Nikos A. Vlassis,et al. Perseus: Randomized Point-based Value Iteration for POMDPs , 2005, J. Artif. Intell. Res..

[9] Byron Boots,et al. An Online Spectral Learning Algorithm for Partially Observable Nonlinear Dynamical Systems , 2011, AAAI.

[10] Nan Jiang,et al. Contextual Decision Processes with low Bellman rank are PAC-Learnable , 2016, ICML.

[11] S. Kakade,et al. Sample-Efficient Reinforcement Learning of Undercomplete POMDPs , 2020, NeurIPS.

[12] Emma Brunskill,et al. A PAC RL Algorithm for Episodic POMDPs , 2016, AISTATS.

[13] Nan Jiang,et al. On Oracle-Efficient PAC RL with Rich Observations , 2018, NeurIPS.

[14] Michael L. Littman,et al. Memoryless policies: theoretical limitations and practical results , 1994 .

[15] Reid G. Simmons,et al. Heuristic Search Value Iteration for POMDPs , 2004, UAI.

[16] John N. Tsitsiklis,et al. The Complexity of Markov Decision Processes , 1987, Math. Oper. Res..

[17] Nan Jiang,et al. Markov Decision Processes with Continuous Side Information , 2017, ALT.

[18] Rémi Munos,et al. Minimax Regret Bounds for Reinforcement Learning , 2017, ICML.

[19] Yao Liu,et al. PAC Continuous State Online Multitask Reinforcement Learning with Identification , 2016, AAMAS.

[20] Nan Jiang,et al. Improving Predictive State Representations via Gradient Descent , 2016, AAAI.

[21] Masoumeh T. Izadi,et al. Sensitivity Analysis of POMDP Value Functions , 2009, 2009 International Conference on Machine Learning and Applications.

[22] Aurélien Garivier,et al. Explore First, Exploit Next: The True Shape of Regret in Bandit Problems , 2016, Math. Oper. Res..

[23] Olivier Buffet,et al. MOMDPs: A Solution for Modelling Adaptive Management Problems , 2012, AAAI.

[24] Richard S. Sutton,et al. Reinforcement Learning: An Introduction , 1998, IEEE Trans. Neural Networks.

[25] Anima Anandkumar,et al. A Method of Moments for Mixture Models and Hidden Markov Models , 2012, COLT.

[26] Michael R. James,et al. Predictive State Representations: A New Theory for Modeling Dynamical Systems , 2004, UAI.

[27] Michael I. Jordan,et al. PEGASUS: A policy search method for large MDPs and POMDPs , 2000, UAI.

[28] Shie Mannor,et al. Latent Bandits , 2014, ICML.

[29] Constantine Caramanis,et al. On the Minimax Optimality of the EM Algorithm for Learning Two-Component Mixed Linear Regression , 2020, AISTATS.

[30] Sergei Vassilvitskii,et al. k-means++: the advantages of careful seeding , 2007, SODA '07.

[31] Shie Mannor,et al. Contextual Markov Decision Processes , 2015, ArXiv.

[32] Hongsheng Xi,et al. Finding optimal memoryless policies of POMDPs under the expected average reward criterion , 2011, Eur. J. Oper. Res..

[33] Kamyar Azizzadenesheli,et al. Reinforcement Learning of POMDPs using Spectral Methods , 2016, COLT.

[34] Geoffrey J. Gordon,et al. Supervised Learning for Dynamical System Learning , 2015, NIPS.

[35] Joelle Pineau,et al. Anytime Point-Based Approximations for Large POMDPs , 2006, J. Artif. Intell. Res..

[36] Anima Anandkumar,et al. Tensor decompositions for learning latent variable models , 2012, J. Mach. Learn. Res..

[37] Byron Boots,et al. Closing the learning-planning loop with predictive state representations , 2009, Int. J. Robotics Res..

[38] Michael I. Jordan,et al. Reinforcement Learning Algorithm for Partially Observable Markov Decision Problems , 1994, NIPS.

[39] Edward J. Sondik,et al. The Optimal Control of Partially Observable Markov Processes over a Finite Horizon , 1973, Oper. Res..

[40] Sham M. Kakade,et al. A spectral algorithm for learning Hidden Markov Models , 2008, J. Comput. Syst. Sci..

[41] Richard S. Sutton,et al. Predictive Representations of State , 2001, NIPS.

[42] Nan Jiang,et al. Provably efficient RL with Rich Observations via Latent State Decoding , 2019, ICML.

[43] Peter Stone,et al. Transfer Learning for Reinforcement Learning Domains: A Survey , 2009, J. Mach. Learn. Res..

[44] Peter Auer,et al. Near-optimal Regret Bounds for Reinforcement Learning , 2008, J. Mach. Learn. Res..

[45] Shuai Li,et al. Online Clustering of Bandits , 2014, ICML.

[46] Constantine Caramanis,et al. The EM Algorithm gives Sample-Optimality for Learning Mixtures of Well-Separated Gaussians , 2020, COLT 2020.

[47] V. N. Bogaevski,et al. Matrix Perturbation Theory , 1991 .

[48] Shuai Li,et al. On Context-Dependent Clustering of Bandits , 2016, ICML.

[49] Leslie Pack Kaelbling,et al. Learning Policies for Partially Observable Environments: Scaling Up , 1997, ICML.