Inverse Reinforcement Learning in Contextual MDPs

We consider the Inverse Reinforcement Learning (IRL) problem in Contextual Markov Decision Processes (CMDPs). Here, the reward of the environment, which is not available to the agent, depends on a static parameter referred to as the context. Each context defines an MDP (with a different reward signal), and the agent is provided demonstrations by an expert, for different contexts. The goal is to learn a mapping from contexts to rewards, such that planning with respect to the induced reward will perform similarly to the expert, even for unseen contexts. We suggest two learning algorithms for this scenario. (1) For rewards that are a linear function of the context, we provide a method that is guaranteed to return an $\epsilon$-optimal solution after a polynomial number of demonstrations. (2) For general reward functions, we propose black-box descent methods based on evolutionary strategies capable of working with nonlinear estimators (e.g., neural networks). We evaluate our algorithms in autonomous driving and medical treatment simulations and demonstrate their ability to learn and generalize to unseen contexts.

[1]  Shie Mannor,et al.  Contextual Markov Decision Processes , 2015, ArXiv.

[2]  Michael Kearns,et al.  Near-Optimal Reinforcement Learning in Polynomial Time , 2002, Machine Learning.

[3]  Martin Jaggi,et al.  Revisiting Frank-Wolfe: Projection-Free Sparse Convex Optimization , 2013, ICML.

[4]  Sergey Levine,et al.  Model-Agnostic Meta-Learning for Fast Adaptation of Deep Networks , 2017, ICML.

[5]  Nan Jiang,et al.  Repeated Inverse Reinforcement Learning , 2017, NIPS.

[6]  Martin Zinkevich,et al.  Online Convex Programming and Generalized Infinitesimal Gradient Ascent , 2003, ICML.

[7]  Mordecai Avriel,et al.  Mathematical Programming for Industrial Engineers , 1997 .

[8]  Haim Kaplan,et al.  Average reward reinforcement learning with unknown mixing times , 2019, ArXiv.

[9]  Michael H. Bowling,et al.  Apprenticeship learning using linear programming , 2008, ICML '08.

[10]  Peter Szolovits,et al.  Deep Reinforcement Learning for Sepsis Treatment , 2017, ArXiv.

[11]  Moustapha Cissé,et al.  Parseval Networks: Improving Robustness to Adversarial Examples , 2017, ICML.

[12]  J. Andrew Bagnell,et al.  Efficient Reductions for Imitation Learning , 2010, AISTATS.

[13]  Elad Hazan,et al.  Introduction to Online Convex Optimization , 2016, Found. Trends Optim..

[14]  Peter Szolovits,et al.  MIMIC-III, a freely accessible critical care database , 2016, Scientific Data.

[15]  Geoffrey J. Gordon,et al.  A Reduction of Imitation Learning and Structured Prediction to No-Regret Online Learning , 2010, AISTATS.

[16]  Jonathan Lee,et al.  Iterative Noise Injection for Scalable Imitation Learning , 2017, ArXiv.

[17]  Robert E. Schapire,et al.  A Game-Theoretic Approach to Apprenticeship Learning , 2007, NIPS.

[18]  Srivatsan Srinivasan,et al.  Truly Batch Apprenticeship Learning with Deep Successor Features , 2019, IJCAI.

[19]  Martin L. Puterman,et al.  Markov Decision Processes: Discrete Stochastic Dynamic Programming , 1994 .

[20]  C. Blair Problem Complexity and Method Efficiency in Optimization (A. S. Nemirovsky and D. B. Yudin) , 1985 .

[21]  Stefano Ermon,et al.  Generative Adversarial Imitation Learning , 2016, NIPS.

[22]  Shamim Nemati,et al.  Does the "Artificial Intelligence Clinician" learn optimal treatment strategies for sepsis in intensive care? , 2019, ArXiv.

[23]  Sébastien Bubeck,et al.  Convex Optimization: Algorithms and Complexity , 2014, Found. Trends Mach. Learn..

[24]  S. Murphy,et al.  Dynamic Treatment Regimes. , 2014, Annual review of statistics and its application.

[25]  Stephen P. Boyd,et al.  Linear controller design: limits of performance , 1991 .

[26]  Anind K. Dey,et al.  Maximum Entropy Inverse Reinforcement Learning , 2008, AAAI.

[27]  Ambuj Tewari,et al.  Contextual Markov Decision Processes using Generalized Linear Models , 2019, ArXiv.

[28]  Marc Teboulle,et al.  Mirror descent and nonlinear projected subgradient methods for convex optimization , 2003, Oper. Res. Lett..

[29]  Xi Chen,et al.  Evolution Strategies as a Scalable Alternative to Reinforcement Learning , 2017, ArXiv.

[30]  A. Slooter,et al.  Intraoperative hypotension and the risk of postoperative adverse outcomes: a systematic review , 2018, British journal of anaesthesia.

[31]  John Langford,et al.  Approximately Optimal Approximate Reinforcement Learning , 2002, ICML.

[32]  Haim Kaplan,et al.  Apprenticeship Learning via Frank-Wolfe , 2019, AAAI.

[33]  Pieter Abbeel,et al.  Exploration and apprenticeship learning in reinforcement learning , 2005, ICML.

[34]  Atul Malhotra,et al.  Personalizing mechanical ventilation for acute respiratory distress syndrome. , 2016, Journal of thoracic disease.

[35]  Tom Schaul,et al.  Successor Features for Transfer in Reinforcement Learning , 2016, NIPS.

[36]  Xin Zhang,et al.  End to End Learning for Self-Driving Cars , 2016, ArXiv.

[37]  Anca D. Dragan,et al.  Learning a Prior over Intent via Meta-Inverse Reinforcement Learning , 2018, ICML.

[38]  Barbara E. Engelhardt,et al.  A Reinforcement Learning Approach to Weaning of Mechanical Ventilation in Intensive Care Units , 2017, UAI.

[39]  Fredrik D. Johansson,et al.  Guidelines for reinforcement learning in healthcare , 2019, Nature Medicine.

[40]  Richard S. Zemel,et al.  SMILe: Scalable Meta Inverse Reinforcement Learning through Context-Conditional Policies , 2019, NeurIPS.

[41]  Dean Pomerleau,et al.  ALVINN, an autonomous land vehicle in a neural network , 2015 .

[42]  Siddhartha S. Srinivasa,et al.  Imitation learning for locomotion and manipulation , 2007, 2007 7th IEEE-RAS International Conference on Humanoid Robots.

[43]  Aldo A. Faisal,et al.  The Artificial Intelligence Clinician learns optimal treatment strategies for sepsis in intensive care , 2018, Nature Medicine.

[44]  Srivatsan Srinivasan,et al.  Evaluating Reinforcement Learning Algorithms in Observational Health Settings , 2018, ArXiv.

[45]  D. Murray,et al.  Sepsis: Personalized Medicine Utilizing ‘Omic’ Technologies—A Paradigm Shift? , 2018, Healthcare.

[46]  Yurii Nesterov,et al.  Random Gradient-Free Minimization of Convex Functions , 2015, Foundations of Computational Mathematics.

[47]  John Darzentas,et al.  Problem Complexity and Method Efficiency in Optimization , 1983 .

[48]  Pieter Abbeel,et al.  Apprenticeship learning via inverse reinforcement learning , 2004, ICML.

[49]  Sergey Levine,et al.  From Language to Goals: Inverse Reinforcement Learning for Vision-Based Instruction Following , 2019, ICLR.

[50]  Dimitri P. Bertsekas,et al.  Nonlinear Programming , 1997 .

[51]  Laurent Orseau,et al.  AI Safety Gridworlds , 2017, ArXiv.

[52]  Andrew Y. Ng,et al.  Pharmacokinetics of a novel formulation of ivermectin after administration to goats , 2000, ICML.

[53]  J. MacQueen Some methods for classification and analysis of multivariate observations , 1967 .

[54]  DAN GARBER,et al.  A Linearly Convergent Variant of the Conditional Gradient Algorithm under Strong Convexity, with Applications to Online and Stochastic Optimization , 2016, SIAM J. Optim..

[55]  H. Robbins A Stochastic Approximation Method , 1951 .

[56]  Nan Jiang,et al.  Markov Decision Processes with Continuous Side Information , 2017, ALT.