论文信息 - Detecting Rewards Deterioration in Episodic Reinforcement Learning

Detecting Rewards Deterioration in Episodic Reinforcement Learning

In many RL applications, once training ends, it is vital to detect any deterioration in the agent performance as soon as possible. Furthermore, it often has to be done without modifying the policy and under minimal assumptions regarding the environment. In this paper, we address this problem by focusing directly on the rewards and testing for degradation. We consider an episodic framework, where the rewards within each episode are not independent, nor identically-distributed, nor Markov. We present this problem as a multivariate mean-shift detection problem with possibly partial observations. We define the meanshift in a way corresponding to deterioration of a temporal signal (such as the rewards), and derive a test for this problem with optimal statistical power. Empirically, on deteriorated rewards in control problems (generated using various environment modifications), the test is demonstrated to be more powerful than standard tests – often by orders of magnitude. We also suggest a novel Bootstrap mechanism for False Alarm Rate control (BFAR), applicable to episodic (non-i.i.d) signal and allowing our test to run sequentially in an online manner. Our method does not rely on a learned model of the environment, is entirely external to the agent, and in fact can be applied to detect changes or drifts in any episodic signal.

Shie Mannor | Ido Greenberg | Shie Mannor | Ido Greenberg

[1] Shie Mannor,et al. A Nonparametric Sequential Test for Online Randomized Experiments , 2016, WWW.

[2] Renato Paes Leme,et al. Bandits with adversarial scaling , 2020, ICML.

[3] Odalric-Ambrym Maillard,et al. Distribution-dependent and Time-uniform Bounds for Piecewise i.i.d Bandits , 2019, ArXiv.

[4] Andreas Krause,et al. Multi-Player Bandits: The Adversarial Case , 2019, J. Mach. Learn. Res..

[5] Diane J. Cook,et al. A survey of methods for time series change point detection , 2017, Knowledge and Information Systems.

[6] S. Pocock. Group sequential methods in the design and analysis of clinical trials , 1977 .

[7] R. Lund,et al. Changepoint Detection in Periodic and Autocorrelated Time Series , 2007 .

[8] H ⋂t,et al. CONVERGENCE THEOREM FOR FINITE MARKOV CHAINS , 2017 .

[9] G. Wahba,et al. Multivariate Bernoulli distribution , 2012, 1206.1874.

[10] Craig MacDonald,et al. Sequential Testing for Early Stopping of Online Experiments , 2015, SIGIR.

[11] Philip Bachman,et al. Deep Reinforcement Learning that Matters , 2017, AAAI.

[12] Javier García,et al. A comprehensive survey on safe reinforcement learning , 2015, J. Mach. Learn. Res..

[13] J. Westgard,et al. Combined Shewhart-cusum control chart for improved quality control in clinical chemistry. , 1977, Clinical chemistry.

[14] Gabriel Dulac-Arnold,et al. Challenges of Real-World Reinforcement Learning , 2019, ArXiv.

[15] Lantao Yu,et al. MOPO: Model-based Offline Policy Optimization , 2020, NeurIPS.

[16] H. Hotelling. The Generalization of Student’s Ratio , 1931 .

[17] Vitaly Levdik,et al. Time Limits in Reinforcement Learning , 2017, ICML.

[18] John F. Canny,et al. Measuring the Reliability of Reinforcement Learning Algorithms , 2019, ICLR.

[19] D L DeMets,et al. Interim analysis: the alpha spending function approach. , 1994, Statistics in medicine.

[20] James Bergstra,et al. Autoregressive Policies for Continuous Control Deep Reinforcement Learning , 2019, IJCAI.

[21] P. O'Brien,et al. A multiple testing procedure for clinical trials. , 1979, Biometrics.

[22] Gregory Ditzler,et al. Learning in Nonstationary Environments: A Survey , 2015, IEEE Computational Intelligence Magazine.

[23] Shie Mannor,et al. Concept Drift Detection Through Resampling , 2014, ICML.

[24] Pierre-Yves Oudeyer,et al. A Hitchhiker's Guide to Statistical Comparisons of Reinforcement Learning Algorithms , 2019, RML@ICLR.

[25] D. A. Evans,et al. An approach to the probability distribution of cusum run length , 1972 .

[26] Marion R. Reynolds,et al. Cusum Charts for Monitoring an Autocorrelated Process , 2001 .

[27] Cristiano Cervellera,et al. QuantTree: Histograms for Change Detection in Multivariate Data Streams , 2018, ICML.

[28] N. L. Johnson,et al. Multivariate Analysis , 1958, Nature.

[29] Vikash Kumar,et al. Multi-Agent Manipulation via Locomotion using Hierarchical Sim2Real , 2019, CoRL.

[30] R. Bellman. A Markovian Decision Process , 1957 .

[31] Omar Besbes,et al. Stochastic Multi-Armed-Bandit Problem with Non-stationary Rewards , 2014, NIPS.

[32] Ufuk Topcu,et al. Safe Reinforcement Learning via Shielding , 2017, AAAI.

[33] Sebastian Junges,et al. Safety-Constrained Reinforcement Learning for MDPs , 2015, TACAS.

[34] Eric Moulines,et al. On Upper-Confidence Bound Policies for Switching Bandit Problems , 2011, ALT.

[35] Emmanuel Yashchin. On the Analysis and Design of CUSUM-Shewhart Control Schemes , 1985, IBM J. Res. Dev..

[36] Gábor Orosz,et al. End-to-End Safe Reinforcement Learning through Barrier Functions for Safety-Critical Continuous Control Tasks , 2019, AAAI.

[37] S M Williams,et al. Quality control: an application of the cusum. , 1992, BMJ.

[38] Anupam Gupta,et al. Better Algorithms for Stochastic Bandits with Adversarial Corruptions , 2019, COLT.

[39] K. Hong. Conditional Value at Risk (CoVAR) , 2010 .

[40] Lorenzo Strigini,et al. Assessing the Safety and Reliability of Autonomous Vehicles from Road Testing , 2019, 2019 IEEE 30th International Symposium on Software Reliability Engineering (ISSRE).

[41] Yutaka Matsuo,et al. Deployment-Efficient Reinforcement Learning via Model-Based Offline Optimization , 2020, ICLR.

[42] W. Fuller,et al. Distribution of the Estimators for Autoregressive Time Series with a Unit Root , 1979 .

[43] Alex Graves,et al. Asynchronous Methods for Deep Reinforcement Learning , 2016, ICML.

[44] Ofir Nachum,et al. A Lyapunov-based Approach to Safe Reinforcement Learning , 2018, NeurIPS.

[45] Tom Schaul,et al. Rainbow: Combining Improvements in Deep Reinforcement Learning , 2017, AAAI.

[46] Shiyu Zhou,et al. Cycle-based signal monitoring using a directionally variant multivariate control chart system , 2005 .

[48] E. S. Pearson,et al. On the Problem of the Most Efficient Tests of Statistical Hypotheses , 1933 .

[49] Heinz Koeppl,et al. Correlation Priors for Reinforcement Learning , 2019, NeurIPS.

[50] Erwan Lecarpentier,et al. Non-Stationary Markov Decision Processes a Worst-Case Approach using Model-Based Reinforcement Learning , 2019, NeurIPS.

[51] S. S. Wilks. The Large-Sample Distribution of the Likelihood Ratio for Testing Composite Hypotheses , 1938 .

[52] Jonathan P. How,et al. Quickest change detection approach to optimal control in Markov decision processes with model changes , 2016, 2017 American Control Conference (ACC).

[53] Laurenz Wiskott. Lecture Notes on Reinforcement Learning , 2018 .

[54] Yuval Tassa,et al. MuJoCo: A physics engine for model-based control , 2012, 2012 IEEE/RSJ International Conference on Intelligent Robots and Systems.

[55] Ludmila I. Kuncheva,et al. Change Detection in Streaming Multivariate Data Using Likelihood Detectors , 2013, IEEE Transactions on Knowledge and Data Engineering.

[56] M. Mohri,et al. Bandit Problems , 2006 .

[57] Minitab. Statistical Methods for Quality Improvement , 2001 .

[58] Pablo Hernandez-Leal,et al. A Survey of Learning in Multiagent Environments: Dealing with Non-Stationarity , 2017, ArXiv.

[59] R. Rockafellar,et al. Optimization of conditional value-at risk , 2000 .

[60] Dirk P. Kroese,et al. Why the Monte Carlo method is so important today , 2014 .

[61] E. S. Page. CONTINUOUS INSPECTION SCHEMES , 1954 .

[62] J M Robins,et al. Marginal Mean Models for Dynamic Regimes , 2001, Journal of the American Statistical Association.

[63] R. Khan,et al. Sequential Tests of Statistical Hypotheses. , 1972 .

[64] Lihong Li,et al. Adversarial Attacks on Stochastic Bandits , 2018, NeurIPS.