# For: Zero-Shot Text-to-Image Generation

1190

1. 背景
2. 模型介绍
1. 模型训练
2. 训练细节
3. 引用

## 模型介绍

1. 第一阶段我们利用离散VAE(dVAE)将256 256大小的图片编码为32 32个图像字符，每个字符有8192个可能的值
2. 第二阶段我们将文本的256个字符和32 * 32个图像字符拼接起来，使用transformer进行自回归建模。

\begin{align} &lnp_{\theta, \psi}(x, y)\\\ =\ &E_{q_\phi(y, z| x)}(lnp_{\theta, \psi}(x, y)) \\\ =\ &E_{q_\phi(y, z| x)}(\frac{lnp_{\theta, \psi}(x, y, z)q_\phi(y, z| x)}{lnp_{\theta,\phi}(z|x, y)q_\phi(y,z|x)}) \\\ =\ &E_{q_\phi(y, z| x)}[ ln p_\theta(x|y, z)\\\ & + lnp_\psi(y, z) \\\ & + lnq_\phi(y, z|x) \\\ & - lnp_{\theta,\psi}(z|x,y)\\\ & - lnq_\phi(y,z|x) ] \\\ =\ &lnp_\theta(x|y, z) \\\ & - E_{q_\phi(y, z| x)}(lnq_\phi(y,z|x) - lnp_\psi(y,z))\\\ & + E_{q_\phi(y, z| x)}(lnq_\phi(y,z|x) - lnp_{\theta,\psi}(z|x, y)) \\\ =\ &lnp_\theta(x|y, z) \\\ & - KL(q_\phi(y, z| x)||p_\psi(y,z)) \\\ & + KL(q_\phi(y, z| x)||p_{\theta,\psi}(y,z|x) - E(lnp_{\theta,\psi}(y|x)) \\\ \geq\ & lnp_\theta(x|y, z)\\\ & - KL(q_\phi(y, z| x)||p_\psi(y,z)) \end{align}

### 模型训练

#### Gumbel-softmax trick

$$E_{\epsilon \sim p(\epsilon)}(g(\epsilon)) = \int g(\epsilon)d F(\epsilon)$$

$$E_{z \sim p_\phi(z)}(f(z)) = E_{\epsilon \in p(\epsilon)}(f(g(\epsilon, \phi)))$$

$$\nabla_\phi E_{z \sim p_\phi(z)}(f(z)) = E_{\epsilon \sim p(\epsilon)}(\nabla f(g(\epsilon, \phi)))$$

$$P(X =k) = \alpha_k$$

$$\hat{X} = \mathop{argmax}_k (log \alpha_k + G_k)$$

$\hat{X}$和$X$的分布是等价的，可以利用Gumbel分布的性质进行简单证明。首先假设

$$z_k = log\alpha_k + G_k$$

$$e^{-e^{log\alpha_k - z_k}}$$

$$e^{log\alpha_k - z_k - e^{log\alpha_k - z_k}}$$

$$P(z_k \geq z_j;\forall j \neq k|z_k, \{\alpha_j\}_{j=1}^{K}) = \prod_{j \neq k}P(z_j \lt z_k) \\ = \prod_{j \neq k} e^{-e^{log\alpha_j - z_k}}$$

\begin{align} &\ P(z_k \geq z_j;\forall j \neq k) \\\ =&\ \int_{z_k}P(z_k \geq z_j;\forall j \neq k|z_k, \{\alpha_j\}_{j=1}^{K})e^{log\alpha_k - z_k - e^{log\alpha_k - z_k}}dz_k \\\ =&\ \int_{z_k}\prod_{j \neq k} e^{-e^{log\alpha_j - z_k}}e^{log\alpha_k - z_k - e^{log\alpha_k - z_k}}dz_k \\\ =&\ \int exp \left( -(\sum_{j=1}^Ke^{log\alpha_j})e^{-z_k} - z_k + log\alpha_k \right) dz_k \\\ =&\ \alpha_k \int_{z_k} exp\left(-\sum_{j=1}^{K}\alpha_j e^{-z_k} - z_k \right)dz_k \\\ =&\ \alpha_k \int_{z_k} exp\left(-\beta e^{-z_k} - z_k \right)dz_k \\\ =&\ \alpha_k \int_{z_k} exp\left(-e^{log\beta -z_k} + log \beta - z_k -log \beta \right)dz_k \\\ =& \frac{\alpha_k}{\beta}\int_{z_k}exp\left(-e^{log\beta -z_k} + log \beta - z_k \right) dz_k\\\ =&\ \frac{\alpha_k}{\beta} = \frac{\alpha_k}{\sum_{j=1}^{K}\alpha_j} \end{align}

$$X^\tau = \mathop{softmax}_k^\tau(log\alpha_k +G_k) \\ = \frac{e^{(log\alpha_k +G_k) / \tau}}{\sum_j e^{(log\alpha_j +G_j) / \tau}}$$

#### dVAE训练

dVAE与标准VAE一样有两部分损失函数，一个是encoder编码的分布与我们假定隐变量分布之间的KL散度，dVAE中假定隐变量为8192个token上的均匀分布，我们将$enc(x)$与均匀分布计算KL散度既可以得到这个损失函数。

## 引用

[1] Ramesh, Aditya, et al. "Zero-shot text-to-image generation." arXiv preprint arXiv:2102.12092 (2021).

[2] https://openai.com/blog/dall-e/

Article Tags
[本]通信工程@河海大学 & [硕]CS@清华大学

0
1190
0

More Recommendations

April 29, 2022
April 22, 2022
April 13, 2022
April 5, 2022