DQN论文是 DeepMind 的研究人员在2013年发表的。一作作者Volodymyr Mnih博士毕业于多伦多大学,对人工智能领域特别是强化学习有杰出贡献。DQN的论文首次使用卷机神经网络直接处理游戏画面来进行Atari游戏的控制。

文章目录


  1. 算法介绍
  2. 实现 tips
  3. DQN pytorch 简单实现
  4. Atari DemonAttack 游戏实验
    1. 部分实验效果
  5. 引用

强化学习基本概念,可参考:《强化学习之一二》

算法介绍

DQN 算法非常简洁,贝尔曼方程的$Q$值版本为:

$$ Q^\star (s, a) = \mathbb{E}_{s^\prime \sim \varepsilon}[r + \gamma\max_{a^\prime} Q^\star(s^\prime, a^\prime) \vert s, a] $$

现在在我们实际使用的时候,通常会用神经网络来拟合$Q$函数,即:$Q(s, a; \theta) \approx Q^\star (s, a)$,其中$\theta$为神经网络参数。我们可以使用如下损失函数来训练我们的$Q$网络:

$$ L_i(\theta_i) = \mathbb{E}_{s, a\sim \rho(\cdot)}\left[ (y_i - Q(s, a; \theta_i))^2 \right] $$

其中:$y_i = \mathbb{E}_{s^\prime \sim \varepsilon}\left[ r + \gamma \max_{a^\prime} Q(s^\prime, a^\prime; \theta_{i-1} \vert s, a) \right]$为第$i$轮迭代的时候,我们$Q$网络想要靠近的目标,其中$\gamma$为衰减系数;$\theta_{i-1}$为前一轮迭代时生成样本路径所用模型的参数,它是固定的;$s, a$服从行为分布(behaviour distribution)(实际使用时,就是我们所生成的样本路径)。

根据损失函数的公式,我们可以得到参数的梯度为:

$$ \nabla_{\theta_i}L_i(\theta_i) = \mathbb{E}_{s, a \sim \rho(\cdot); s^\prime \sim \varepsilon} \left[ \left(r + \gamma \max_{a^\prime} Q(s^\prime, a^\prime; \theta_{i-1}) - Q(s, a; \theta_i) \right) \nabla_{\theta_i} Q(s, a; \theta_i) \right] $$

pytorch的实现中,我们无需直接计算梯度,我们可以直接最小化$y_i$$Q(s, a; \theta_i)$间的MSE损失即可。

实现 tips

  1. 使用经验回放(experience replay)缓冲区来提高样本利用率
  2. 每次从缓冲区中采样$N$个,来增加训练的稳定性

在下面的实现中,我们默认会缓存32轮的数据,并采用先进先出原则,数据超过缓冲区大小后,会将最早的数据移除。此外,我们每一轮训练完毕,会使用我们最新训练后的模型来更新我们的行为模型

DQN pytorch 简单实现

我们首先实现一个类,用于存储样本路径:

class EpisodeData(object):

    def __init__(self):
        self.fields = [
            'states', 'actions', 'rewards', 'dones', 'log_probs', 'next_states'
        ]
        for f in self.fields:
            setattr(self, f, [])
        self.total_rewards = 0

    def add_record(self,
                   state,
                   action,
                   reward,
                   done,
                   log_prob=None,
                   next_state=None):
        self.states.append(state)
        self.actions.append(action)
        self.log_probs.append(log_prob)
        self.dones.append(done)
        self.rewards.append(reward)
        self.next_states.append(next_state)
        self.total_rewards += reward

    def get_states(self):
        return np.array(self.states)

    def get_actions(self):
        return np.array(self.actions)

    def steps(self):
        return len(self.states)

    def calc_qs(self, pre_model, gamma):
        next_states = torch.tensor(np.array(self.next_states)).float()
        next_qs = pre_model(next_states).max(dim=-1).values
        masks = torch.tensor(np.array(self.dones) == 0)

        rewards = torch.tensor(np.array(self.rewards)).view(-1)
        qs = rewards + gamma * next_qs * masks

        return qs.detach().float()

然后我们实现一下DQN算法本体:

from torch import optim

class DQN(object):

    def __init__(self,
                 env,
                 model,
                 lr=1e-5,
                 optimizer='adam',
                 device='cpu',
                 deterministic=False,
                 gamma=0.95,
                 n_replays=4,
                 batch_size=200,
                 model_kwargs=None,
                 exploring=None,
                 n_trained_times=1,
                 n_buffers=32,
                 model_prefix="dqn"):
        self.env = env
        self.model = model
        self.lr = lr
        self.optimizer = optimizer
        self.device = device
        self.deterministic = deterministic
        self.gamma = gamma
        self.n_replays = n_replays
        self.batch_size = batch_size
        self.model_kwargs = model_kwargs
        if optimizer == 'adam':
            self.optimizer = optim.Adam(self.model.parameters(), lr=self.lr)
        elif optimizer == 'sgd':
            self.optimizer = optim.SGD(self.model.parameters(), lr=self.lr)

        self.exploring = exploring
        self.n_trained_times = n_trained_times

        if self.model_kwargs:
            self.pre_model = self.model.__class__(**self.model_kwargs)
        else:
            self.pre_model = self.model.__class__()

        self.data_buffer = []
        self.n_buffers = n_buffers
        self.model_prefix = model_prefix

        self.copy_model()

    def gen_epoch_data(self, n_steps=1024, exploring=0., done_penalty=0):
        state = self.env.reset()
        done = False
        epoch_data = EpisodeData()

        self.model.eval()
        steps = 0

        for _ in range(n_steps):
            steps += 1

            qs = self.model(torch.tensor(state[np.newaxis, :]).float())

            if exploring and np.random.rand() <= exploring:
                action = self.env.action_space.sample()
            else:
                action = qs[0].argmax().item()

            next_state, reward, done, _ = self.env.step(int(action))
            if done and done_penalty:
                reward -= done_penalty

            epoch_data.add_record(state,
                                  action,
                                  reward,
                                  1 if done else 0,
                                  next_state=next_state)
            state = next_state

            if done:
                state = self.env.reset()

        return epoch_data

    def get_exploring(self, need_exploring=False, mexp=0.1):
        if need_exploring:
            return max(mexp, self.n_trained_times**(-0.5))
        if isinstance(self.exploring, float):
            return self.exploring
        elif self.exploring == 'quadratic_decrease':
            return max(0.01, self.n_trained_times**(-0.5))

        return 0.01

    def copy_model(self):
        self.pre_model.load_state_dict(self.model.state_dict())
        self.pre_model.eval()

    def train(self, epoch_data):
        total_loss = 0.
        qs = epoch_data.calc_qs(self.pre_model, gamma=0.95).to(self.device)
        states = torch.tensor(epoch_data.get_states()).float().to(self.device)
        actions = torch.tensor(epoch_data.get_actions()[:, np.newaxis]).to(
            self.device)

        n_batches = ceil(len(epoch_data.states) / self.batch_size)
        indices = torch.randperm(len(epoch_data.states)).to(self.device)
        for b in range(n_batches):
            batch_indices = indices[b * self.batch_size:(b + 1) *
                                    self.batch_size]
            batch_states = states[batch_indices]
            batch_actions = actions[batch_indices]
            batch_qs = qs[batch_indices]

            qs_pred = self.model(batch_states).gather(1,
                                                      batch_actions).view(-1)
            loss_func = nn.MSELoss()
            loss = loss_func(batch_qs, qs_pred)
            self.optimizer.zero_grad()
            loss.backward()
            self.optimizer.step()
            total_loss += loss.item()

        return total_loss / n_batches

    def learning(self, n_epoches=100, n_steps=1024):
        self.model.train()

        max_reward = -10000.
        decay_reward = 0
        decay = 0.95

        for n in range(n_epoches):
            # generate new data
            new_data = self.gen_epoch_data(n_steps=n_steps,
                                           exploring=self.get_exploring()
                                           if not self.deterministic else 0.)
            self.data_buffer.insert(0, new_data)
            if len(self.data_buffer) > self.n_buffers:
                self.data_buffer = self.data_buffer[:self.n_buffers]

            # training
            for data in self.data_buffer[::-1]:
                loss = self.train(data)

            # update static model
            self.copy_model()

            # show training information
            decay_reward = new_data.total_rewards if decay_reward == 0 else (
                decay_reward * decay + new_data.total_rewards * (1 - decay))

            if max_reward < decay_reward:
                max_reward = decay_reward
                torch.save(self.model.state_dict(),
                           f'./models/{self.model_prefix}-success-v{n}.pt')

            if n % 10 == 0:
                print(
                    f'round: {n:>3d} | loss: {loss:>5.3f} | '
                    f'pre reward: {decay_reward:>5.2f}',
                    flush=True)

Atari DemonAttack 游戏实验

使用强化学习来训练atari游戏,效率一直很低,需要训练很长时间。这里我们使用openai Gym中的DemonAttack-ram-v0环境来训练。

我们使用Atari游戏环境,一般不会直接使用,这里我们调整了以下几点:

  • 跳帧:atari 游戏设计时是60帧的,而游戏每次都会读取用户的输入。一般为了加快训练,我们会设置每输入一个action,我们让环境执行多步。在我们的代码中,我们每个动作默认会执行8
  • 防止空转:我们知道,一开始模型很有可能在很长时间都没有获取奖励。这就导致了很多操作没有收益,这种操作对训练其实没有任何用处。所以,在我们的实现中,会设置一个最长的为获取奖励的步数,默认我们设置为60
  • 死亡、空转惩罚:生命减少或者达到空转阈值后,我们会将奖励减去一个预设值
  • 默认ram的状态是0~255之间的值,我们将其归一化到0~1
  • 奖励值我们默认除以10(乘以$0.1$

那么,最后我们的环境修改为:

# -*- coding: utf-8 -*-
from gym import Wrapper
import numpy as np


class SkipframeWrapper(Wrapper):

    def __init__(self,
                 env,
                 n_skip=8,
                 n_max_nops=0,
                 done_penalty=50,
                 reward_scale=0.1,
                 lives_penalty=50):
        self.n_skip = n_skip
        self.env = env
        
        # 最大空转步数
        self.n_max_nops = n_max_nops
        self.n_nops = 0
        
        # 游戏结束惩罚
        self.done_penalty = done_penalty
        
        # 奖励缩放系数
        self.reward_scale = reward_scale
        
        # 失去生命惩罚
        self.lives_penalty = lives_penalty

    def reset(self):
        self.n_nops = 0
        self.n_pre_lives = None
        return self.env.reset()

    def step(self, action):
        n = self.n_skip
        total_reward = 0
        current_lives = None
        while n > 0:
            n -= 1
            state, _reward, done, info = self.env.step(action)
            total_reward += _reward
            if 'lives' in info:
                current_lives = info['lives']

            if done:
                break

        if current_lives is not None:
            if self.n_pre_lives is not None and current_lives < self.n_pre_lives:
                total_reward -= self.lives_penalty

            self.n_pre_lives = current_lives

        state = state.astype(np.float) / 256.

        if total_reward == 0:
            self.n_nops += 1

        if self.n_max_nops and self.n_nops >= self.n_max_nops:
            done = True

        if done:
            total_reward -= self.done_penalty

        total_reward *= self.reward_scale

        return state, total_reward, done, info

我们的模型使用简单的双隐层MLP结构:

import numpy as np
import torch
from torch import nn


class DARModel(ModuleInitMixin, nn.Module):

    def __init__(self, device='cpu') -> None:
        super().__init__()
        self.fc = nn.Sequential(
            nn.Linear(128, 128),
            nn.ReLU(),
            nn.Linear(128, 128),
            nn.ReLU(),
            nn.Linear(128, 64),
            nn.ReLU(),
            nn.Linear(64, 6),
        )
        self.device = device
        self._initialize_weights()

    def _initialize_weights(self):
        for module in self.modules():
            if isinstance(module, nn.Conv2d) or isinstance(module, nn.Linear):
                nn.init.normal_(module.weight, 0, 0.05)
                nn.init.normal_(module.bias, 0, 0.1)

    def forward(self, x):
        if isinstance(x, np.ndarray):
            x = torch.tensor(x).float()

        x = x.to(self.device)
        return self.fc(x)

部分实验效果

游戏训练过程非常缓慢,这里展示一下训练了几个小时的效果(挑了个看起来还凑合的):

Atari 射击游戏实验效果
图 1:Atari 射击游戏实验效果

引用

[1] Mnih, Volodymyr, et al. "Playing atari with deep reinforcement learning."arXiv preprint arXiv:1312.5602(2013).

[2] 强化学习之一二:https://paperexplained.cn/articles/article/sdetail/ed046429-1b20-458f-9483-9089f2ae5acb/


[本]通信工程@河海大学 & [硕]CS@清华大学
这个人很懒,他什么也没有写!

0
3863
0

More Recommendations


Nov. 30, 2022