Acting in Delayed Environments with Non-Stationary Markov Policies

The standard Markov Decision Process (MDP) formulation hinges on the assumption that an action is executed immediately after it was chosen. However, assuming it is often unrealistic and can lead to catastrophic failures in applications such as robotic manipulation, cloud computing, and finance. We introduce a framework for learning and planning in MDPs where the decision-maker commits actions that are executed with a delay of m steps. The brute-force state augmentation baseline where the state is concatenated to the last m committed actions suffers from an exponential complexity in m, as we show for policy iteration. We then prove that with execution delay, Markov policies in the original state-space are sufficient for attaining maximal reward, but need to be non-stationary. As for stationary Markov policies, we show they are sub-optimal in general. Consequently, we devise a non-stationary Q-learning style model-based algorithm that solves delayed execution tasks without resorting to state-augmentation. Experiments on tabular, physical, and Atari domains reveal that it converges quickly to high performance even for substantial delays, while standard approaches that either ignore the delay or rely on state-augmentation struggle or fail due to divergence. The code will be shared upon publication.

[1]  Uri Zwick,et al.  Lower Bounds for Howard's Algorithm for Finding Minimum Mean-Cost Cycles , 2010, ISAAC.

[2]  Gabriel Dulac-Arnold,et al.  Challenges of Real-World Reinforcement Learning , 2019, ArXiv.

[3]  Karol Hausman,et al.  Thinking While Moving: Deep Reinforcement Learning with Concurrent Control , 2020, ICLR.

[4]  John Fearnley,et al.  Exponential Lower Bounds for Policy Iteration , 2010, ICALP.

[5]  Chris Pal,et al.  Real-Time Reinforcement Learning , 2019, NeurIPS.

[6]  Sidney Nascimento Givigi,et al.  Multiple Model Q-Learning for Stochastic Asynchronous Rewards , 2016, J. Intell. Robotic Syst..

[7]  Agnès Sulem,et al.  Explicit Solution of Inventory Problems with Delivery Lags , 1995, Math. Oper. Res..

[8]  Shane Legg,et al.  IMPALA: Scalable Distributed Deep-RL with Importance Weighted Actor-Learner Architectures , 2018, ICML.

[9]  Jean-Pierre Richard,et al.  Time-delay systems: an overview of some recent advances and open problems , 2003, Autom..

[10]  Jean-Charles Delvenne,et al.  The complexity of Policy Iteration is exponential for discounted Markov Decision Processes , 2012, 2012 IEEE 51st IEEE Conference on Decision and Control (CDC).

[11]  Yanzhi Wang,et al.  26ms Inference Time for ResNet-50: Towards Real-Time Execution of all DNNs on Smartphone , 2019, ArXiv.

[12]  Joshua B. Tenenbaum,et al.  At Human Speed: Deep Reinforcement Learning with Action Delay , 2018, ArXiv.

[13]  Peter Stone,et al.  TEXPLORE: real-time sample-efficient reinforcement learning for robots , 2012, Machine Learning.

[14]  Huyen Pham,et al.  Impulse control problem on finite horizon with execution delay , 2007 .

[15]  Ronald A. Howard,et al.  Dynamic Programming and Markov Processes , 1960 .

[16]  David K. Smith,et al.  Dynamic Programming and Optimal Control. Volume 1 , 1996 .

[17]  Erik I. Verriest,et al.  Stability and Control of Time-delay Systems , 1998 .

[18]  Bruno Scherrer,et al.  Improved and Generalized Upper Bounds on the Complexity of Policy Iteration , 2013, Math. Oper. Res..

[19]  Minyi Guo,et al.  Characterizing Perception Module Performance and Robustness in Production-Scale Autonomous Driving System , 2019, NPC.

[20]  Liang Li,et al.  Delay-Aware Model-Based Reinforcement Learning for Continuous Control , 2020, Neurocomputing.

[21]  Thomas J. Walsh,et al.  Learning and planning in environments with delayed feedback , 2009, Autonomous Agents and Multi-Agent Systems.

[22]  Emilia Fridman,et al.  Introduction to Time-Delay Systems: Analysis and Control , 2014 .

[23]  Marc G. Bellemare,et al.  The Arcade Learning Environment: An Evaluation Platform for General Agents (Extended Abstract) , 2012, IJCAI.

[24]  Dimitri P. Bertsekas,et al.  Dynamic Programming and Optimal Control, Two Volume Set , 1995 .

[25]  Baiming Chen,et al.  Delay-Aware Multi-Agent Reinforcement Learning , 2020, ArXiv.

[26]  Martin L. Puterman,et al.  Markov Decision Processes: Discrete Stochastic Dynamic Programming , 1994 .

[27]  Pingfan Meng,et al.  Towards Safety-Aware Computing System Design in Autonomous Vehicles , 2019, ArXiv.

[28]  András György,et al.  Online Learning under Delayed Feedback , 2013, ICML.

[29]  Konstantinos V. Katsikopoulos,et al.  Markov decision processes with delays and asynchronous cost collection , 2003, IEEE Trans. Autom. Control..

[30]  Sanja Fidler,et al.  Learning to Simulate Dynamic Environments With GameGAN , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).