Zero-Shot Text-to-Image Generation

Abstract

전통적인 Text-to-image generation은 모델구조를 얼마나 잘 짜거나, domain-specific한 걸 이용해서(object part label, segmentation 등을 이용) 이루어졌는데 본 논문에서 autoregressive한 transformer를 기반으로 더 단순한 approach임에도 좋은 결과를 얻을 수 있었다.(물론 데이터는 충분해야함)

dalle_fig1

Method

Transformer 이용하고 Stage 1, Stage 2로 구성

\[\begin{align} \ln p_{\theta, \psi} (x,y) \geq &= \mathbb{E}_{q_\phi(z\lvert x)}[ \ln p_\theta(x\lvert y,z)] - \beta KL(q_\phi(y,z\lvert x) \lvert \lvert p_\psi(y,z)) \\ \end{align}\]

$y$: Captions, $z$: tokens for encoded RGB images. $x$: images

dalle_fig2

Stage 1 : Learning the Visual Codebook

dVAE1 이용하여 256×256 RGB image into a 32 × 32 grid of image tokens으로 압축하고 각 image token은 8192개의 categorical distribution의 outcome으로 되어 있다.
This reduces the context size of the transformer by a factor of $192=8\times 8\times 3$ without a large degradation in visual quality.
$p_\psi$는 vqvae처럼 uniform categorical distributions 고정하여 사용..
the gumbel-softmax relaxation을 이용해서 $q_\phi$를 $q_\phi^\tau$로 relaxation하여 사용. $\tau$를 0으로 보내면 argmax가 됨..
$p_\theta$는 log-laplace distribution을 사용.
$\beta$는 6.6으로 사용, promotes better codebook usage.. 또한 recon error도 낮게 나옴!
Stage 2 : Learning the Prior
We concatenate up to 256 BPE-encoded text tokens with the 32 × 32 = 1024 image tokens.
Train an autoregressive transformer to model the joint distribution over the text and image tokens.

구체적으로

$\phi, \theta$를 구정하고 text와 image에 대한 prior distribution을 학습하였다(with respect to $\psi$)
$p_\psi$는 120억 파라미터 갯수를 갖는 sparse transformer..
text-image pair가 주어졌을 때 text같은 경우는 최대 256개의 토큰 & vocab_size of each token이 16384를 갖도록 lowercased caption에 BPE-encode를 했다.
image같은 경우는 $32\times 32=1024$개의 토큰이며 한 토큰의 vocab_size는 8192인걸로 encode 했다.
image token이 dVAE의 encoder의 logit output으로 argmax sampling을 했다.
Transformer는 decoder only 모델을 썼으며 each image token이 64개의 self-attention layer중에 하나의 layer에서 모든 text-token에 대하여 attention을 계산.
이 모델은 3 종류의 sparse attention masks를 사용하였다. 마지막 Convolutional mask는 마지막 self-attention layer에서만 사용..
Given the index i of a self-attention layer (with $i$ ∈$[1, 63]$), we use the column attention mask if $i − 2$ mod $4 = 0$, and row attention otherwise. E.g., the first four self-attention layers use “row, column, row, row”, respectively.

Conclusion

실제 활용이 가능한 text-to-image generation based on an autoregressive transformer을 개발하였다.
zero-shot performance relative to previous domain-specific approaches / range of capabilities that emerge from a single generative model 측면에서도 좋은결과를 냄.
좋은 일반화 결과를 얻기 위해서 scale을 조절하면 된다.

If you like this post, don’t forget to give me a star.

PREVIOUSNeural Discrete Representation Learning

NEXTGenerative Adversarial Nets

Abstract

Method

Stage 1 : Learning the Visual Codebook

Stage 2 : Learning the Prior

구체적으로

Conclusion