4472

## 参数探索策略梯度算法（PGPE, Policy Gradients with Parameter-Based Exploration）

### 强化学习基本模型

$$J(\theta) = \int_H p(h|\theta)r(h)dh \tag{1}$$

$$\nabla J(\theta) = \int_H p(h|\theta) \nabla_\theta log p(h|\theta) r(h) dh \tag{2}$$

$$\nabla J(\theta)\int_H p(h|\theta) \sum_{t+1}^T \nabla_\theta p(a_t|s_t, \theta) r(h) dh \tag{3}$$

$$\nabla_\theta J(\theta) \approx \cfrac{1}{N} \sum_{n+1}^N \sum_{t+1}^T \nabla_\theta p(a_t^n|s_t^n, \theta) r(h^n) \tag{4}$$

### PGPE

PGPE通过将上述的概率性策略替换为一个参数$\theta$的概率分布来解决高方差的问题。假设策略的参数$\theta$服从参数为$\rho$的分布，我们有：

$$p(a_t|s_t, \rho) = \int_\theta p(\theta|\rho) \delta_{F_\theta (s_t), a_t} d\theta \tag{5}$$

PGPE另外的一个好处就是，无需通过反向传播来计算导数，仅需通过参数扰动即可实现梯度的计算。这样对于一些不可微的控制器，我们也可以使用该方法。

$$J(\rho) = \int_\theta \int_H p(h, \theta|\rho) r(h) dh d\theta \tag{6}$$

$$\nabla_\theta J(\rho) = \int_\theta \int_H p(h, \theta|\rho) \nabla_\rho \text{log} p(h, \theta|\rho) r(h) dh d\theta \tag{7}$$

$$\nabla_\theta J(\rho) = \int_\theta \int_H p(h|\theta) p(\theta|\rho) \nabla_\rho \text{log} p(\theta|\rho) r(h) dh d\theta \tag{8}$$

$$\nabla_\rho J(\rho) \approx \cfrac{1}{N} \sum_{n=1}^{N} \nabla_\rho \text{log} p(\theta|\rho) r(h^n) \tag{9}$$

$$\nabla_{\mu_i} \text{log} p(\theta|\rho) = \cfrac{\theta_i - \mu_i}{\sigma_i^2} \tag{10.1}$$
$$\nabla_{\sigma_i} \text{log} p(\theta | \rho) = \cfrac{(\theta_i - \mu_i)^2 - \sigma_i^2}{\sigma_i^3} \tag{10.2}$$

#### 改进一：添加基线值

$$\Delta\mu_i = \alpha (r-b)(\theta_i - \mu_i) \tag{11.1}$$
$$\Delta \sigma_i = \alpha(r-b) \cfrac{(\theta_i - \mu_i)^2 - \sigma_i^2}{\sigma_i} \tag{11.2}$$

#### 改进二：对称采样

$$\nabla_{\mu_i} J(\rho) \approx \cfrac{\epsilon_i (r^+ - r^-)}{2\sigma_i^2} \tag{12}$$

$$\Delta \mu_i = \cfrac{\alpha \epsilon_i (r^+ - r^-)}{2} \tag{13}$$

$$\Delta \sigma_i = \alpha \left( \cfrac{r^+ + r^-}{2} -b \right) \left( \cfrac{\epsilon_i^2 - \sigma_i^2}{\sigma_i} \right) \tag{14}$$

#### 改进三：归一化

$$\Delta \mu_i = \cfrac{\alpha \epsilon_i (r^+ - r^-)}{2m - r^+ - r^-} \tag{15.1}$$
$$\Delta \sigma_i = \cfrac{\alpha}{m - b} \left(\cfrac{r^+ + r^- - b}{2}\right)\left( \cfrac{\epsilon^2 - \sigma_i^2}{\sigma_i} \right) \tag{15.2}$$

## 引用

[1] Sehnke, F., Osendorfer, C., Rückstieß, T., Graves, A., Peters, J., & Schmidhuber, J. (2010). Parameter-exploring policy gradients. Neural Networks, 23(4), 551-559.

Article Tags
[本]通信工程@河海大学 & [硕]CS@清华大学

0
4472
0

More Recommendations

Nov. 30, 2022
Nov. 21, 2022
Oct. 18, 2022
Sept. 2, 2022