An Empirical Comparison of Off-policy Prediction Learning Algorithms in the Four Rooms Environment

Many off-policy prediction learning algorithms have been proposed in the past decade, but it remains unclear which algorithms learn faster than others. In this paper, we empirically compare 11 off-policy prediction learning algorithms with linear function approximation on two small tasks: the Rooms task, and the High Variance Rooms task. The tasks are designed such that learning fast in them is challenging. In the Rooms task, the product of importance sampling ratios can be as large as 2 and can sometimes be two. To control the high variance caused by the product of the importance sampling ratios, step size should be set small, which in turn slows down learning. The High Variance Rooms task is more extreme in that the product of the ratios can become as large as 2 × 25. This paper builds upon the empirical study of off-policy prediction learning algorithms by Ghiassian and Sutton (2021). We consider the same set of algorithms as theirs and employ the same experimental methodology. The algorithms considered are: Off-policy TD(λ), five Gradient-TD algorithms, two EmphaticTD algorithms, Tree Backup(λ), Vtrace(λ), and ABTD(ζ). We found that the algorithms’ performance is highly affected by the variance induced by the importance sampling ratios. The data shows that Tree Backup(λ), Vtrace(λ), and ABTD(ζ) are not affected by the high variance as much as other algorithms but they restrict the effective bootstrapping parameter in a way that is too limiting for tasks where high variance is not present. We observed that Emphatic TD(λ) tends to have lower asymptotic error than other algorithms, but might learn more slowly in some cases. We suggest algorithms for practitioners based on their problem of interest, and suggest approaches that can be applied to specific algorithms that might result in substantially improved algorithms. 1 Off-policy Prediction Learning To learn off-policy is to learn about a target policy while behaving using a different behavior policy. In prediction learning, the policies are given and fixed. In this paper, we conduct a comparative empirical study of 11 off-policy prediction learning algorithms on two tasks. Off-policy learning is interesting for many reasons from learning options (Sutton, Precup, & Singh, 1999), to auxiliary tasks (Jaderberg et al., 2016), and learning from historical data (Thomas, 2015). One interesting use-case of off-policy learning is learning about many different policies in parallel (Sutton et al., 2011), which we consider in this paper. In previous work, many algorithms have been developed for off-policy prediction learning. Off-policy TD(λ) uses importance sampling ratios to correct for the differences between the target and behavior policies but it is not guaranteed to converge (Precup, Sutton, & Singh, 2000). Later, Gradient and Emphatic-TD algorithms were proposed to guarantee convergence under off-policy training with linear function approximation (Sutton et al., 2009; Sutton, Mahmood, & White, 2016). These convergent algorithms were later developed further to learn faster. Proximal GTD2(λ), TDRC(λ), and Emphatic TD(λ, β) are examples of such algorithms (Mahadevan et al., 2014; Ghiassian et al., 2020; Hallak et al., 2016). Another interesting group of algorithms exclusively focuses on learning fast, rather than convergence. An example of such algorithm is Vtrace(λ) (Espeholt et al., 2018). Off-policy learning has been essential to many of the recent successes of Deep Reinforcement Learning (Deep RL). The DQN architecture (Mnih et al., 2015) and its successors such as Double DQN (van Hasselt, Guezm, & Silver, 2016) and Rainbow (Hessel et al., 2018) rely on off-policy learning. The core of many of these architectures is Q-learning (Watkins, 1989), the first algorithm developed for off-policy control. Recent research used some modern off-policy algorithms such as Vtrace and Emphatic-TD within Deep RL architectures (Espeholt et al., 2018; Jiang et al., 2021), but it remains unclear which of the many off-policy learning algorithms developed to date empirically outperforms others. Unfortunately, due to the computational burden, it is impossible to conduct a large comparative study in a complex environment such as the Arcade Learning Environment (ALE). The original DQN agent (Mnih et al., 2015) was trained for one run with a single parameter setting. A detailed comparative study, on the other hand, needs at least 30 runs; typically includes a dozen algorithms, each of which have their own parameters. For example, to compare 10 algorithms on the ALE, each with 100 parameter settings (combinations of stepsize parameter, bootstrapping parameter, etc.), for 30 runs, we need 30,000 times more compute than what was used to train the DQN agent on an Atari game. One might think that given the increase in available compute since 2015, such a study might be feasible. Moore’s law states that the available compute approximately doubles every two years. That means compared to 2015, eight times more compute is at hand today. Taking this into account, we still need 30,000/8=3750 times ar X iv :2 10 9. 05 11 0v 1 [ cs .L G ] 1 0 Se p 20 21 more compute than what was used to train one DQN agent. This is simply not feasible now, or in the foreseeable future. Let us now examine the possibility of conducting a comparative study in a state-of-the-art domain, similar to Atari, but smaller. MinAtar (Young & Tian, 2019) simplifies the ALE environment considerably, but presents many of the same challenges. To evaluate the possibility of conducting a comparative study in MinAtar, we compared the training time of two agents. One agent used the original DQN architecture (Mnih et al., 2015), and another used the much smaller Neural Network (NN) architecture of Young and Tian (2019) used for training in MinAtar. Both agents were trained for 30,000 frames on an Intel Xeon Gold 6148, 2.4 GHz CPU core. On average, each MinAtar training frame took 0.003 of a second (0.003s) and each ALE training frame took 0.043s. To speed up training, we repeated the same procedure on an NVidia V100SXM2 (16GB memory) GPU. Each MinAtar training frame took 0.0023s and each ALE training frame took 0.0032s. The GPU did a good job speeding up the process that used a large NN (in the ALE), but did not provide much of a benefit on the smaller NN used in MinAtar. This means, assumign we have enough GPUs to train on, using MinAtar and ALE will not be that different. Given this data, detailed comparative studies in an environment such as MinAtar are still far out of reach. The most meaningful empirical comparisons have in fact been in small domains. Geist and Scherrer (2014) was the first such study that compared all off-policy prediction learning algorithms to date. Their results were complemented by Dann, Neumann, and Peters (2014) with one extra algorithm and a few new problems. Both studies included quadratic and linear computation algorithms. White and White (2016) followed with a study on prediction learning algorithms but narrowed down the space of algorithms to the ones with linear computation, which in turn allowed them to go into greater detail in terms of sensitivity to parameters. The study by Ghiassian and Sutton (2021) also focused on linear computation algorithms. They introduced a small task, called the Collision task and applied 11 algorithms to it. They explored the parameter space in detail and studied four extra algorithms. Taking into account the final performance, learning speed, and sensitivity to various parameters, Ghiassian and Sutton (2021) grouped the 11 algorithms into three tiers. We will discuss their grouping in more detail later in the paper. This paper conducts a comparative study of off-policy prediction learning algorithms with a focus on the variance issue in off-policy learning. The structure of the paper is similar to Ghiassian and Sutton (2021) and considers the same algorithms as theirs but applies the algorithms to two tasks that have 10 times larger state spaces than the Collision task. The product of importance sampling ratios in the Rooms task is larger than the Collision task and the product of the ratios in the High Variance Rooms task is larger than the Rooms task. We explore the whole parameter space of algorithms and conclude that the problem variance—variance induced by the product of the importance sampling ratios—heavily affects the algorithm performance. 2 Formal Framework We simulate the agent-environment interaction using the MDP framework. An agent and an environment interact at discrete time steps, t = 0, 1, . . . At each time step the environment is in a state St ∈ S and chooses an action, At ∈ A under a behavior policy b : A× S→ [0, 1]. For a state and an action (s, a), the probability that action a is taken in state s is denoted by b(a|s) where “|” means that the probability distribution is over a for each s. After choosing an action, the agent receives a numerical reward Rt+1 ∈ R ⊂ R and the environment moves to the next state St+1. The transition from St to St+1 depends on the MDP’s transition dynamics. In off-policy learning, the policy the agent learns about is different from the policy the agent uses for behaviour. The policy the agent learns about is denoted by π and is termed the target policy, whereas the policy that is used for behavior is denoted by b and is termed the behavior policy. The goal is to learn the expectation of the sum of the future rewards, the return, under a target policy. Both target and behavior policies are fixed in prediction learning. The return includes a termination function: γ : S×A× S→ [0, 1]: Gt def = Rt+1 + γ(St, At, St+1)Rt+2+ γ(St, At, St+1)γ(St+1, At+1, St+2)Rt+3 + · · · If for some triplet, Sk, Ak, Sk+1, the termination function returns zero, the accumulation of the rewards is terminated. The expectation of the return when starting from a specific state and following a specific policy thereafter, is called the value of the sta

[1]  Doina Precup,et al.  Eligibility Traces for Off-Policy Policy Evaluation , 2000, ICML.

[2]  Tian Tian,et al.  MinAtar: An Atari-Inspired Testbed for Thorough and Reproducible Reinforcement Learning Experiments , 2019 .

[3]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[4]  Pascal Vincent,et al.  Convergent Tree-Backup and Retrace with Function Approximation , 2017, ICML.

[5]  Adam White,et al.  Gradient Temporal-Difference Learning with Regularized Corrections , 2020, ICML.

[6]  Bo Liu,et al.  Proximal Reinforcement Learning: A New Theory of Sequential Decision Making in Primal-Dual Spaces , 2014, ArXiv.

[7]  Philip S. Thomas,et al.  Safe Reinforcement Learning , 2015 .

[8]  David Silver,et al.  Deep Reinforcement Learning with Double Q-Learning , 2015, AAAI.

[9]  Patrick M. Pilarski,et al.  Horde: a scalable real-time architecture for learning knowledge from unsupervised sensorimotor interaction , 2011, AAMAS.

[10]  Ben J. A. Kröse,et al.  Learning from delayed rewards , 1995, Robotics Auton. Syst..

[11]  Martha White,et al.  Investigating Practical Linear Temporal Difference Learning , 2016, AAMAS.

[12]  Richard S. Sutton,et al.  Multi-step Off-policy Learning Without Importance Sampling Ratios , 2017, ArXiv.

[13]  Martha White,et al.  An Emphatic Approach to the Problem of Off-policy Temporal-Difference Learning , 2015, J. Mach. Learn. Res..

[14]  R. Sutton,et al.  Gradient temporal-difference learning algorithms , 2011 .

[15]  Peter Stone,et al.  Reinforcement learning , 2019, Scholarpedia.

[16]  Extending the Sliding-step Technique of Stochastic Gradient Descent to Temporal Difference Learning by , 2018 .

[17]  Doina Precup,et al.  Between MDPs and Semi-MDPs: A Framework for Temporal Abstraction in Reinforcement Learning , 1999, Artif. Intell..

[18]  Kai Ruggeri,et al.  Policy evaluation , 2018, Behavioral Insights for Public Policy.

[19]  Raphaëlle Branche,et al.  At the human level , 2019 .

[20]  Marek Petrik,et al.  Finite-Sample Analysis of Proximal Gradient TD Algorithms , 2015, UAI.

[21]  Marek Petrik,et al.  Proximal Gradient Temporal Difference Learning Algorithms , 2016, IJCAI.