SlideShare a Scribd company logo
AttnGAN: Fine-Grained Text to Image Generation
with Attentional Generative Adversarial Networks
Tao Xu∗1
, Pengchuan Zhang2
, Qiuyuan Huang2
,
Han Zhang3
, Zhe Gan4
, Xiaolei Huang1
, Xiaodong He2
1
Lehigh University 2
Microsoft Research 3
Rutgers University 4
Duke University
{tax313, xih206}@lehigh.edu, {penzhan, qihua, xiaohe}@microsoft.com
han.zhang@cs.rutgers.edu, zhe.gan@duke.edu
Abstract
In this paper, we propose an Attentional Generative Ad-
versarial Network (AttnGAN) that allows attention-driven,
multi-stage refinement for fine-grained text-to-image gener-
ation. With a novel attentional generative network, the At-
tnGAN can synthesize fine-grained details at different sub-
regions of the image by paying attentions to the relevant
words in the natural language description. In addition, a
deep attentional multimodal similarity model is proposed to
compute a fine-grained image-text matching loss for train-
ing the generator. The proposed AttnGAN significantly out-
performs the previous state of the art, boosting the best re-
ported inception score by 14.14% on the CUB dataset and
170.25% on the more challenging COCO dataset. A de-
tailed analysis is also performed by visualizing the atten-
tion layers of the AttnGAN. It for the first time shows that
the layered attentional GAN is able to automatically select
the condition at the word level for generating different parts
of the image.
1. Introduction
Automatically generating images according to natural
language descriptions is a fundamental problem in many
applications, such as art generation and computer-aided de-
sign. It also drives research progress in multimodal learning
and inference across vision and language, which is one of
the most active research areas in recent years [20, 18, 31,
19, 4, 29, 5, 1, 30]
Most recently proposed text-to-image synthesis methods
are based on Generative Adversarial Networks (GANs) [6].
A commonly used approach is to encode the whole text de-
scription into a global sentence vector as the condition for
GAN-based image generation [20, 18, 31, 32]. Although
∗work was performed when was an intern with Microsoft Research
this bird is red with white and has a very short beak
10:short 3:red 11:beak 9:very 8:a
3:red 5:white 1:bird 10:short 0:this
Figure 1. Example results of the proposed AttnGAN. The first row
gives the low-to-high resolution images generated by G0, G1 and
G2 of the AttnGAN; the second and third row shows the top-5
most attended words by Fattn
1 and Fattn
2 of the AttnGAN, re-
spectively. Here, images of G0 and G1 are bilinearly upsampled
to have the same size as that of G2 for better visualization.
impressive results have been presented, conditioning GAN
only on the global sentence vector lacks important fine-
grained information at the word level, and prevents the gen-
eration of high quality images. This problem becomes even
more severe when generating complex scenes such as those
in the COCO dataset [14].
To address this issue, we propose an Attentional Genera-
tive Adversarial Network (AttnGAN) that allows attention-
driven, multi-stage refinement for fine-grained text-to-
image generation. The overall architecture of the AttnGAN
is illustrated in Figure 2. The model consists of two novel
components. The first component is an attentional gener-
1
arXiv:1711.10485v1[cs.CV]28Nov2017
ative network, in which an attention mechanism is devel-
oped for the generator to draw different sub-regions of the
image by focusing on words that are most relevant to the
sub-region being drawn (see Figure 1). More specifically,
besides encoding the natural language description into a
global sentence vector, each word in the sentence is also
encoded into a word vector. The generative network uti-
lizes the global sentence vector to generate a low-resolution
image in the first stage. In the following stages, it uses
the image vector in each sub-region to query word vectors
by using an attention layer to form a word-context vector.
It then combines the regional image vector and the corre-
sponding word-context vector to form a multimodal context
vector, based on which the model generates new image fea-
tures in the surrounding sub-regions. This effectively yields
a higher resolution picture with more details at each stage.
The other component in the AttnGAN is a Deep Attentional
Multimodal Similarity Model (DAMSM). With an attention
mechanism, the DAMSM is able to compute the similarity
between the generated image and the sentence using both
the global sentence level information and the fine-grained
word level information. Thus, the DAMSM provides an ad-
ditional fine-grained image-text matching loss for training
the generator.
The contribution of our method is threefold. (i) An
Attentional Generative Adversarial Network is proposed
for synthesizing images from text descriptions. Specif-
ically, two novel components are proposed in the At-
tnGAN, including the attentional generative network and
the DAMSM. (ii) Comprehensive study is carried out to em-
pirically evaluate the proposed AttnGAN. Experimental re-
sults show that the AttnGAN significantly outperforms pre-
vious state-of-the-art GAN models. (iii) A detailed analy-
sis is performed through visualizing the attention layers of
the AttnGAN. For the first time, it is demonstrated that the
layered conditional GAN is able to automatically attend to
relevant words to form the condition for image generation.
2. Related Work
Generating high resolution images from text descrip-
tions, though very challenging, is important for many prac-
tical applications such as art generation and computer-
aided design. Recently, great progress has been achieved
in this direction with the emergence of deep generative
models [12, 26, 6]. Mansimov et al. [15] built the align-
DRAW model, extending the Deep Recurrent Attention
Writer (DRAW) [7] to iteratively draw image patches while
attending to the relevant words in the caption. Nguyen
et al. [16] proposed an approximate Langevin approach
to generate images from captions. Reed et al. [21] used
conditional PixelCNN [26] to synthesize images from text
with a multi-scale model structure. Compared with other
deep generative models, Generative Adversarial Networks
(GANs) [6] have shown great performance for generating
sharper samples [17, 3, 23, 13, 10]. Reed et al. [20] first
showed that the conditional GAN was capable of synthesiz-
ing plausible images from text descriptions. Their follow-
up work [18] also demonstrated that GAN was able to gen-
erate better samples by incorporating additional conditions
(e.g., object locations). Zhang et al. [31, 32] stacked sev-
eral GANs for text-to-image synthesis and used different
GANs to generate images of different sizes. However, all
of their GANs are conditioned on the global sentence vec-
tor, missing fine-grained word level information for image
generation.
The attention mechanism has recently become an inte-
gral part of sequence transduction models. It has been suc-
cessfully used in modeling multi-level dependencies in im-
age captioning [29], image question answering [30] and
machine translation [2]. Vaswani et al. [27] also demon-
strated that machine translation models could achieve state-
of-the-art results by solely using an attention model. In
spite of these progress, the attention mechanism has not
been explored in GANs for text-to-image synthesis yet. It is
worth mentioning that the alignDRAW [15] also used LAP-
GAN [3] to scale the image to a higher resolution. How-
ever, the GAN in their framework was only utilized as a
post-processing step without attention. To our knowledge,
the proposed AttnGAN for the first time develops an atten-
tion mechanism that enables GANs to generate fine-grained
high quality images via multi-level (e.g., word level and
sentence level) conditioning.
3. Attentional Generative Adversarial Net-
work
As shown in Figure 2, the proposed Attentional Gener-
ative Adversarial Network (AttnGAN) has two novel com-
ponents: the attentional generative network and the deep
attentional multimodal similarity model. We will elaborate
each of them in the rest of this section.
3.1. Attentional Generative Network
Current GAN-based models for text-to-image genera-
tion [20, 18, 31, 32] typically encode the whole-sentence
text description into a single vector as the condition for im-
age generation, but lack fine-grained word level informa-
tion. In this section, we propose a novel attention model
that enables the generative network to draw different sub-
regions of the image conditioned on words that are most
relevant to those sub-regions.
As shown in Figure 2, the proposed attentional genera-
tive network has m generators (G0, G1, ..., Gm−1), which
take the hidden states (h0, h1, ..., hm−1) as input and gen-
erate images of small-to-large scales (ˆx0, ˆx1, ..., ˆxm−1).
256x256x3
Attentional Generative Network
c
z~N(0,I)
h2h1h0
D0
128x128x364x64x3
Text
Encoder
sentence
feature
word
features
Attention models
Local image
features
Deep Attentional Multimodal Similarity Model (DAMSM)
Conv3x3JoiningUpsamplingFC with reshapeResidual
this bird is red with
white and has a
very short beak D2D1
attn
F1
attn
F2
ca
F
F0F0
F1F1 F2F2
G2
G1G0
Image
Encoder
Figure 2. The architecture of the proposed AttnGAN. Each attention model automatically retrieves the conditions (i.e., the most relevant
word vectors) for generating different sub-regions of the image; the DAMSM provides the fine-grained image-text matching loss for the
generative network.
Specifically,
h0 = F0(z, Fca
(e));
hi = Fi(hi−1, Fattn
i (e, hi−1)) for i = 1, 2, ..., m − 1;
ˆxi = Gi(hi).
(1)
Here, z is a noise vector usually sampled from a standard
normal distribution. e is a global sentence vector, and e is
the matrix of word vectors. Fca
represents the Conditioning
Augmentation [31] that converts the sentence vector e to the
conditioning vector. Fattn
i is the proposed attention model
at the ith
stage of the AttnGAN. Fca
, Fattn
i , Fi, and Gi are
modeled as neural networks.
The attention model Fattn
(e, h) has two inputs: the
word features e ∈ RD×T
and the image features from the
previous hidden layer h ∈ R
ˆD×N
. The word features are
first converted into the common semantic space of the im-
age features by adding a new perceptron layer, i.e., e = Ue,
where U ∈ R
ˆD×D
. Then, a word-context vector is com-
puted for each sub-region of the image based on its hidden
features h (query). Each column of h is a feature vector of
a sub-region of the image. For the jth
sub-region, its word-
context vector is a dynamic representation of word vectors
relevant to hj, which is calculated by
cj =
T −1
i=0
βj,iei, where βj,i =
exp(sj,i)
T −1
k=0 exp(sj,k)
, (2)
sj,i = hT
j ei, and βj,i indicates the weight the model attends
to the ith
word when generating the jth
sub-region of the
image. We then donate the word-context matrix for image
feature set h by Fattn
(e, h) = (c0, c1, ..., cN−1) ∈ R
ˆD×N
.
Finally, image features and the corresponding word-context
features are combined to generate images at the next stage.
To generate realistic images with multiple levels (i.e.,
sentence level and word level) of conditions, the final objec-
tive function of the attentional generative network is defined
as
L = LG + λLDAMSM , where LG =
m−1
i=0
LGi . (3)
Here, λ is a hyperparameter to balance the two terms of
Eq. (3). The first term is the GAN loss that jointly approx-
imates conditional and unconditional distributions [32]. At
the ith
stage of the AttnGAN, the generator Gi has a cor-
responding discriminator Di. The adversarial loss for Gi is
defined as
LGi
= −
1
2
Eˆxi∼pGi
[log(Di(ˆxi)]
unconditional loss
−
1
2
Eˆxi∼pGi
[log(Di(ˆxi, e)]
conditional loss
,
(4)
where the unconditional loss determines whether the image
is real or fake while the conditional loss determines whether
the image and the sentence match or not.
Alternately to the training of Gi, each discriminator Di
is trained to classify the input into the class of real or fake
by minimizing the cross-entropy loss defined by
LDi
= −
1
2
Exi∼pdatai
[log Di(xi)] −
1
2
Eˆxi∼pGi
[log(1 − Di(ˆxi)]
unconditional loss
+
−
1
2
Exi∼pdatai
[log Di(xi, e)] −
1
2
Eˆxi∼pGi
[log(1 − Di(ˆxi, e)]
conditional loss
,
(5)
where xi is from the true image distribution pdatai at the
ith
scale, and ˆxi is from the model distribution pGi
at the
same scale. Discriminators of the AttnGAN are structurally
disjoint, so they can be trained in parallel and each of them
focuses on a single image scale.
The second term of Eq. (3), LDAMSM , is a word level
fine-grained image-text matching loss computed by the
DAMSM, which will be elaborated in Subsection 3.2.
3.2. Deep Attentional Multimodal Similarity Model
The DAMSM learns two neural networks that map sub-
regions of the image and words of the sentence to a common
semantic space, thus measures the image-text similarity at
the word level to compute a fine-grained loss for image gen-
eration.
The text encoder is a bi-directional Long Short-Term
Memory (LSTM) [24] that extracts semantic vectors from
the text description. In the bi-directional LSTM, each word
corresponds to two hidden states, one for each direction.
Thus, we concatenate its two hidden states to represent the
semantic meaning of a word. The feature matrix of all
words is indicated by e ∈ RD×T
. Its ith
column ei is the
feature vector for the ith
word. D is the dimension of the
word vector and T is the number of words. Meanwhile, the
last hidden states of the bi-directional LSTM are concate-
nated to be the global sentence vector, denoted by e ∈ RD
.
The image encoder is a Convolutional Neural Network
(CNN) that maps images to semantic vectors. The inter-
mediate layers of the CNN learn local features of different
sub-regions of the image, while the later layers learn global
features of the image. More specifically, our image en-
coder is built upon the Inception-v3 model [25] pretrained
on ImageNet [22]. We first rescale the input image to be
299×299 pixels. And then, we extract the local feature ma-
trix f ∈ R768×289
(reshaped from 768×17×17) from the
“mixed 6e” layer of Inception-v3. Each column of f is the
feature vector of a sub-region of the image. 768 is the di-
mension of the local feature vector, and 289 is the number
of sub-regions in the image. Meanwhile, the global feature
vector f ∈ R2048
is extracted from the last average pooling
layer of Inception-v3. Finally, we convert the image fea-
tures to a common semantic space of text features by adding
a perceptron layer:
v = Wf , v = W f, (6)
where v ∈ RD×289
and its ith
column vi is the visual fea-
ture vector for the ith
sub-region of the image; and v ∈ RD
is the global vector for the whole image. D is the dimension
of the multimodal (i.e., image and text modalities) feature
space. For efficiency, all parameters in layers built from the
Inception-v3 model are fixed, and the parameters in newly
added layers are jointly learned with the rest of the net-
work.
The attention-driven image-text matching score is
designed to measure the matching of an image-sentence pair
based on an attention model between the image and the text.
We first calculate the similarity matrix for all possible
pairs of words in the sentence and sub-regions in the image
by
s = eT
v, (7)
where s ∈ RT ×289
and si,j is the dot-product similarity
between the ith
word of the sentence and the jth
sub-region
of the image. We find that it is beneficial to normalize the
similarity matrix as follows
si,j =
exp(si,j)
T −1
k=0 exp(sk,j)
. (8)
Then, we build an attention model to compute a region-
context vector for each word (query). The region-context
vector ci is a dynamic representation of the image’s sub-
regions related to the ith
word of the sentence. It is com-
puted as the weighted sum over all regional visual vectors,
i.e.,
ci =
288
j=0
αjvj, where αj =
exp(γ1si,j)
288
k=0 exp(γ1si,k)
. (9)
Here, γ1 is a factor that determines how much attention is
paid to features of its relevant sub-regions when computing
the region-context vector for a word.
Finally, we define the relevance between the ith
word
and the image using the cosine similarity between ci and ei,
i.e., R(ci, ei) = (cT
i ei)/(||ci||||ei||). Inspired by the mini-
mum classification error formulation in speech recognition
(see, e.g., [11, 8]), the attention-driven image-text match-
ing score between the entire image (Q) and the whole text
description (D) is defined as
R(Q, D) = log
T −1
i=1
exp(γ2R(ci, ei))
1
γ2
, (10)
where γ2 is a factor that determines how much to mag-
nify the importance of the most relevant word-to-region-
context pair. When γ2 → ∞, R(Q, D) approximates to
maxT −1
i=1 R(ci, ei).
The DAMSM loss is designed to learn the attention
model in a semi-supervised manner, in which the only su-
pervision is the matching between entire images and whole
sentences (a sequence of words). Similar to [4, 9], for a
batch of image-sentence pairs {(Qi, Di)}M
i=1, the posterior
probability of sentence Di being matching with image Qi
is computed as
P(Di|Qi) =
exp(γ3R(Qi, Di))
M
j=1 exp(γ3R(Qi, Dj))
, (11)
where γ3 is a smoothing factor determined by experiments.
In this batch of sentences, only Di matches the image Qi,
and treat all other M − 1 sentences as mismatching de-
scriptions. Following [4, 9], we define the loss function as
the negative log posterior probability that the images are
matched with their corresponding text descriptions (ground
truth), i.e.,
Lw
1 = −
M
i=1
log P(Di|Qi), (12)
where ‘w’ stands for “word”.
Symmetrically, we also minimize
Lw
2 = −
M
i=1
log P(Qi|Di), (13)
where P(Qi|Di) = exp(γ3R(Qi,Di))
M
j=1 exp(γ3R(Qj ,Di))
is the posterior
probability that sentence Di is matched with its correspond-
ing image Qi. If we redefine Eq. (10) by R(Q, D) =
vT
e / ||v||||e|| and substitute it to Eq. (11), (12) and
(13), we can obtain loss functions Ls
1 and Ls
2 (where ‘s’
stands for “sentence”) using the sentence vector e and the
global image vector v.
Finally, the DAMSM loss is defined as
LDAMSM = Lw
1 + Lw
2 + Ls
1 + Ls
2. (14)
Based on experiments on a held-out validation set, we set
the hyperparameters in this section as: γ1 = 5, γ2 = 5,
γ3 = 10 and M = 50. Our DAMSM is pretrained 1
by
minimizing LDAMSM using real image-text pairs. Since
the size of images for pretraining DAMSM is not limited
by the size of images that can be generated, real images of
size 299×299 are utilized. In addition, the pretrained text-
encoder in the DAMSM provides visually-discriminative
word vectors learned from image-text paired data for the
attentional generative network. In comparison, conven-
tional word vectors pretrained on pure text data are often
not visually-discriminative, e.g., word vectors of different
colors, such as red, blue, yellow, etc., are often clustered
together in the vector space, due to the lack of grounding
them to the actual visual signals.
In sum, we propose two novel attention models, the at-
tentional generative network and the DAMSM, which play
different roles in the AttnGAN. (i) The attention mechanism
in the generative network (see Eq. 2) enables the AttnGAN
to automatically select word level condition for generating
different sub-regions of the image. (ii) With an attention
mechanism (see Eq. 9), the DAMSM is able to compute
the fine-grained text-image matching loss LDAMSM . It is
worth mentioning that, LDAMSM is applied only on the
output of the last generator Gm−1, because the eventual
goal of the AttnGAN is to generate large images by the last
1We also finetuned the DAMSM with the whole network, however the
performance was not improved.
Dataset
CUB [28] COCO [14]
train test train test
#samples 8,855 2,933 80k 40k
caption/image 10 10 5 5
Table 1. Statistics of datasets.
generator. We tried to apply LDAMSM on images of all
resolutions generated by (G0, G1, ..., Gm−1). However, the
performance was not improved but the computational cost
was increased.
4. Experiments
Extensive experimentation is carried out to evaluate the
proposed AttnGAN. We first study the important compo-
nents of the AttnGAN, including the attentional genera-
tive network and the DAMSM. Then, we compare our At-
tnGAN with previous state-of-the-art GAN models for text-
to-image synthesis [31, 32, 20, 18, 16].
Datasets. Same as previous text-to-image meth-
ods [31, 32, 20, 18], our method is evaluated on CUB [28]
and COCO [14] datasets. We preprocess the CUB dataset
according to the method in [31]. Table 1 lists the statistics
of datasets.
Evaluation. Following Zhang et al. [31], we use the
inception score [23] as the quantitative evaluation measure.
Since the inception score cannot reflect whether the gener-
ated image is well conditioned on the given text description,
we propose to use R-precision, a common evaluation met-
ric for ranking retrieval results, as a complementary eval-
uation metric for the text-to-image synthesis task. If there
are R relevant documents for a query, we examine the top
R ranked retrieval results of a system, and find that r are
relevant, and then by definition, the R-precision is r/R.
More specifically, we conduct a retrieval experiment, i.e.,
we use generated images to query their corresponding text
descriptions. First, the image and text encoders learned in
our pretrained DAMSM are utilized to extract global feature
vectors of the generated images and the given text descrip-
tions. And then, we compute cosine similarities between the
global image vectors and the global text vectors. Finally,
we rank candidate text descriptions for each image in de-
scending similarity and find the top r relevant descriptions
for computing the R-precision. To compute the inception
score and the R-precision, each model generates 30,000 im-
ages from randomly selected unseen text descriptions. The
candidate text descriptions for each query image consist of
one ground truth (i.e., R = 1) and 99 randomly selected
mismatching descriptions.
Besides quantitative evaluation, we also qualitatively
examine the samples generated by our models. Specifi-
cally, we visualize the intermediate results with attention
learned by the attention models Fattn
. As defined in
Eq. (2), weights βj,i indicates which words the model at-
3.4
3.6
3.8
4
4.2
4.4
50 150 250 350 450 550
Inceptionscore
Epoch
AttnGAN1,no DAMSM
AttnGAN1,𝞴=0.1
AttnGAN1,𝞴=1
AttnGAN1,𝞴=5
AttnGAN1,𝞴=10
AttnGAN2,𝞴=5
0
10
20
30
40
50
60
70
50 150 250 350 450 550R-precision(%)
Epoch
AttnGAN1,no DAMSM AttnGAN1,𝞴=0.1
AttnGAN1,𝞴=1 AttnGAN1,𝞴=5
AttnGAN1,𝞴=10 AttnGAN2,𝞴=5
0
5
10
15
20
25
30
10 30 50 70 90 110 130 150
Inceptionscore
Epoch
AttnGAN1,𝞴=0.1 AttnGAN1,𝞴=1
AttnGAN1,𝞴=10 AttnGAN1,𝞴=50
AttnGAN1,𝞴=100 AttnGAN2,𝞴=50
0
10
20
30
40
50
60
70
80
90
100
10 30 50 70 90 110 130 150
R-precision(%)
Epoch
AttnGAN1,𝞴=0.1 AttnGAN1,𝞴=1
AttnGAN1,𝞴=10 AttnGAN1,𝞴=50
AttnGAN1,𝞴=100 AttnGAN2,𝞴=50
Figure 3. Inception scores and R-precision rates by our AttnGAN
and its variants at different epochs on CUB (top) and COCO (bot-
tom) test sets. For the text-to-image synthesis task, R = 1.
Method inception score R-precision(%)
AttnGAN1, no DAMSM 3.98 ± .04 10.37± 5.88
AttnGAN1, λ = 0.1 4.19 ± .06 16.55± 4.83
AttnGAN1, λ = 1 4.35 ± .05 34.96± 4.02
AttnGAN1, λ = 5 4.35 ± .04 58.65± 5.41
AttnGAN1, λ = 10 4.29 ± .05 63.87± 4.85
AttnGAN2, λ = 5 4.36 ± .03 67.82 ± 4.43
AttnGAN2, λ = 50
25.89 ± .47 85.47 ± 3.69
(COCO)
Table 2. The best inception score and the corresponding R-
precision rate of each AttnGAN model on CUB (top six rows) and
COCO (the last row) test sets. More results in Figure 3.
tends to when generating a sub-region of the image, and
T −1
i=0 βj,i = 1. We suppress the less-relevant words for an
image’s sub-region via
ˆβj,i =
βj,i, if βj,i > 1/T,
0, otherwise.
(15)
For better visualization, we fix the word and compute its at-
tention weights with N different sub-regions of an image,
ˆβ0,i, ˆβ1,i, ..., ˆβN−1,i. We reshape the N attention weights
to
√
N ×
√
N pixels, which are then upsampled with Gaus-
sian filters to have the same size as the generated images.
Limited by the length of the paper, we only visualize the
top-5 most attended words (i.e., words with top-5 highest
N−1
j=0
ˆβj,i values) for each attention model.
4.1. Component analysis
In this section, we first quantitatively evaluate the At-
tnGAN and its variants. The results are shown in Table 2
and Figure 3. Our “AttnGAN1” architecture has one atten-
tion model and two generators, while the “AttnGAN2” ar-
chitecture has two attention models stacked with three gen-
erators (see Figure 2). In addition, as illustrated in Figure 4,
Figure 5, Figure 6, and Figure 7, we qualitatively examine
the images generated by our AttnGAN.
The DAMSM loss. To test the proposed LDAMSM ,
we adjust the value of λ (see Eq. (3)). As shown in Fig-
ure 3, a larger λ leads to a significantly higher R-precision
rate on both CUB and COCO datasets. On the CUB dataset,
when the value of λ is increased from 0.1 to 5, the incep-
tion score of the AttnGAN1 is improved from 4.19 to 4.35
and the corresponding R-precision rate is increased from
16.55% to 58.65% (see Table 2). On the COCO dataset,
by increasing the value of λ from 0.1 to 50, the AttnGAN1
achieves both high inception score and R-precision rate (see
Figure 3). This comparison demonstrates that properly in-
creasing the weight of LDAMSM helps to generate higher
quality images that are better conditioned on given text de-
scriptions. The reason is that the proposed fine-grained
image-text matching loss LDAMSM provides additional su-
pervision (i.e., word level matching information) for train-
ing the generator. Moreover, in our experiments, we do not
observe any collapsed nonsensical mode in the visualization
of AttnGAN-generated images. It indicates that, with extra
supervision, the fine-grained image-text matching loss also
helps to stabilize the training process of the AttnGAN. In
addition, if we replace the proposed DAMSM sub-network
with the text encoder used in [19], on the CUB dataset, the
inception score and R-precision drops to 3.98 and 10.37%,
respectively (i.e., the “AttnGAN1, no DAMSM” entry in ta-
ble 2), which further demonstrates the effectiveness of the
proposed LDAMSM .
The attentional generative network. As shown in Ta-
ble 2 and Figure 3, stacking two attention models in the
generative networks not only generates images of a higher
resolution (from 128×128 to 256×256 resolution), but also
yields higher inception scores on both CUB and COCO
datasets. In order to guarantee the image quality, we find
the best value of λ for each dataset by increasing the value
of λ until the overall inception score is starting to drop on
a held-out validation set. “AttnGAN1” models are built for
searching the best λ, based on which a “AttnGAN2” model
is built to generate higher resolution images. Due to GPU
memory constraints, we did not try the AttnGAN with three
attention models. As the result, our final model for CUB
and COCO is “AttnGAN2, λ=5” and “AttnGAN2, λ=50”,
respectively. The final λ of the COCO dataset turns out to
be much larger than that of the CUB dataset, indicating that
the proposed LDAMSM is especially important for generat-
the bird has a yellow crown and a black eyering that is round this bird has a green crown black primaries and a white belly
1:bird 4:yellow 0:the 12:round 11:is 1:bird 0:this 2:has 11:belly 10:white
1:bird 4:yellow 0:the 8:black 12:round 6:black 4:green 10:white 0:this 1:bird
a photo of a homemade swirly pasta with broccoli carrots and onions a fruit stand display with bananas and kiwi
0:a 7:with 5:swirly 8:broccoli 10:and 0:a 6:and 1:fruit 7:kiwi 5:bananas
8:broccoli 6:pasta 0:a 9:carrot 5:swirly 0:a 5:bananas 1:fruit 7:kiwi 6:and
Figure 4. Intermediate results of our AttnGAN on CUB (top) and COCO (bottom) test sets. In each block, the first row gives 64×64 images
by G0, 128×128 images by G1 and 256×256 images by G2 of the AttnGAN; the second and third row shows the top-5 most attended
words by Fattn
1 and Fattn
2 of the AttnGAN, respectively. Refer to the supplementary material for more examples.
Dataset GAN-INT-CLS [20] GAWWN [18] StackGAN [31] StackGAN-v2 [32] PPGN [16] Our AttnGAN
CUB 2.88 ± .04 3.62 ± .07 3.70 ± .04 3.82 ± .06 / 4.36 ± .03
COCO 7.88 ± .07 / 8.45 ± .03 / 9.58 ± .21 25.89 ± .47
Table 3. Inception scores by state-of-the-art GAN models [20, 18, 31, 32, 16] and our AttnGAN on CUB and COCO test sets.
ing complex scenarios like those in the COCO dataset.
To better understand what has been learned by the At-
tnGAN, we visualize its intermediate results with attention.
As shown in Figure 4, the first stage of the AttnGAN (G0)
just sketches the primitive shape and colors of objects and
generates low resolution images. Since only the global sen-
tence vectors are utilized in this stage, the generated images
lack details described by exact words, e.g., the beak and
eyes of a bird. Based on word vectors, the following stages
(G1 and G2) learn to rectify defects in results of the previ-
ous stage and add more details to generate higher-resolution
images. Some sub-regions/pixels of G1 or G2 images can
be inferred directly from images generated by the previous
stage. For those sub-regions, the attention is equally allo-
this bird has wings that are black and has a white belly
this bird has wings that are red and has a yellow belly
this bird has wings that are blue and has a red belly
Figure 5. Example results of our AttnGAN model trained on CUB
while changing some most attended words in the text descriptions.
a fluffy black
cat floating on
top of a lake
a red double
decker bus
is floating on
top of a lake
a stop sign
is floating on
top of a lake
a stop sign
is flying in
the blue sky
Figure 6. 256×256 images generated from descriptions of novel
scenarios using the AttnGAN model trained on COCO. (Interme-
diate results are given in the supplementary material.)
Figure 7. Novel images by our AttnGAN on the CUB test set.
cated to all words and shown to be black in the attention
map (see Figure 4). For other sub-regions, which usually
have semantic meaning expressed in the text description
such as the attributes of objects, the attention is allocated to
their most relevant words (bright regions in Figure 4). Thus,
those regions are inferred from both word-context features
and previous image features of those regions. As shown in
Figure 4, on the CUB dataset, the words the, this, bird are
usually attended by the Fattn
models for locating the ob-
ject; the words describing object attributes, such as colors
and parts of birds, are also attended for correcting defects
and drawing details. On the COCO dataset, we have similar
observations. Since there are usually more than one ob-
ject in each COCO image, it is more visible that the words
describing different objects are attended by different sub-
regions of the image, e.g., bananas, kiwi in the bottom-right
block of Figure 4. Those observations demonstrate that the
AttnGAN learns to understand the detailed semantic mean-
ing expressed in the text description of an image. Another
observation is that our second attention model Fattn
2 is able
to attend to some new words that were omitted by the first
attention model Fattn
1 (see Figure 4). It demonstrates that,
to provide richer information for generating higher resolu-
tion images at latter stages of the AttnGAN, the correspond-
ing attention models learn to recover objects and attributes
omitted at previous stages.
Generalization ability. Our experimental results above
have quantitatively and qualitatively shown the generaliza-
tion ability of the AttnGAN by generating images from
unseen text descriptions. Here we further test how sensi-
tive the outputs are to changes in the input sentences by
changing some most attended words in the text descriptions.
Some examples are shown in Figure 5. It illustrates that the
generated images are modified according to the changes in
the input sentences, showing that the model can catch sub-
tle semantic differences in the text description. Moreover,
as shown in Figure 6, our AttnGAN can generate images to
reflect the semantic meaning of descriptions of novel sce-
narios that are not likely to happen in the real world, e.g.,
a stop sign is floating on top of a lake. On the other hand,
we also observe that the AttnGAN sometimes generates im-
ages which are sharp and detailed, but are not likely realis-
tic. As examples shown in Figure 7, the AttnGAN creates
birds with multiple heads, eyes or tails, which only exist in
fairy tales. This indicates that our current method is still
not perfect in capturing global coherent structures, which
leaves room to improve. To sum up, observations shown
in Figure 5, Figure 6 and Figure 7 further demonstrate the
generalization ability of the AttnGAN.
4.2. Comparison with previous methods
We compare our AttnGAN with previous state-of-the-
art GAN models for text-to-image generation on CUB and
COCO test sets. As shown in Table 3, on the CUB dataset,
our AttnGAN achieves 4.36 inception score, which signif-
icantly outperforms the previous best inception score of
3.82. More impressively, our AttnGAN boosts the best
reported inception score on the COCO dataset from 9.58
to 25.89, a 170.25% improvement relatively. The COCO
dataset is known to be much more challenging than the
CUB dataset because it consists of images with more com-
plex scenarios. Existing methods struggle in generating re-
alistic high-resolution images on this dataset. Examples
in Figure 4 and Figure 6 illustrate that our AttnGAN suc-
ceeds in generating 256×256 images for various scenarios
on the COCO dataset, although those generated images of
the COCO dataset are not as photo-realistic as that of the
CUB dataset. The experimental results show that, com-
pared to previous state-of-the-art approaches, the AttnGAN
is more effective for generating complex scenes due to its
novel attention mechanism that catches fine-grained word
level and sub-region level information in text-to-image gen-
eration.
5. Conclusions
In this paper, we propose an Attentional Generative Ad-
versarial Network, named AttnGAN, for fine-grained text-
to-image synthesis. First, we build a novel attentional gen-
erative network for the AttnGAN to generate high qual-
ity image through a multi-stage process. Second, we pro-
pose a deep attentional multimodal similarity model to com-
pute the fine-grained image-text matching loss for train-
ing the generator of the AttnGAN. Our AttnGAN signif-
icantly outperforms previous state-of-the-art GAN models,
boosting the best reported inception score by 14.14% on the
CUB dataset and 170.25% on the more challenging COCO
dataset. Extensive experimental results clearly demonstrate
the effectiveness of the proposed attention mechanism in
the AttnGAN, which is especially critical for text-to-image
generation for complex scenes.
References
[1] A. Agrawal, J. Lu, S. Antol, M. Mitchell, C. L. Zitnick, D. Parikh,
and D. Batra. VQA: visual question answering. IJCV, 123(1):4–31,
2017. 1
[2] D. Bahdanau, K. Cho, and Y. Bengio. Neural machine translation by
jointly learning to align and translate. arXiv:1409.0473, 2014. 2
[3] E. L. Denton, S. Chintala, A. Szlam, and R. Fergus. Deep generative
image models using a laplacian pyramid of adversarial networks. In
NIPS, 2015. 2
[4] H. Fang, S. Gupta, F. N. Iandola, R. K. Srivastava, L. Deng, P. Doll´ar,
J. Gao, X. He, M. Mitchell, J. C. Platt, C. L. Zitnick, and G. Zweig.
From captions to visual concepts and back. In CVPR, 2015. 1, 4, 5
[5] Z. Gan, C. Gan, X. He, Y. Pu, K. Tran, J. Gao, L. Carin, and L. Deng.
Semantic compositional networks for visual captioning. In CVPR,
2017. 1
[6] I. J. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-
Farley, S. Ozair, A. C. Courville, and Y. Bengio. Generative ad-
versarial nets. In NIPS, 2014. 1, 2
[7] K. Gregor, I. Danihelka, A. Graves, D. J. Rezende, and D. Wierstra.
DRAW: A recurrent neural network for image generation. In ICML,
2015. 2
[8] X. He, L. Deng, and W. Chou. Discriminative learning in sequential
pattern recognition. IEEE Signal Processing Magazine, 25(5):14–36,
2008. 4
[9] P.-S. Huang, X. He, J. Gao, L. Deng, A. Acero, and L. Heck. Learn-
ing deep structured semantic models for web search using click-
through data. In CIKM, 2013. 4, 5
[10] P. Isola, J.-Y. Zhu, T. Zhou, and A. A. Efros. Image-to-image trans-
lation with conditional adversarial networks. In CVPR, 2017. 2
[11] B.-H. Juang, W. Chou, and C.-H. Lee. Minimum classification error
rate methods for speech recognition. IEEE Transactions on Speech
and Audio Processing, 5(3):257–265, 1997. 4
[12] D. P. Kingma and M. Welling. Auto-encoding variational bayes. In
ICLR, 2014. 2
[13] C. Ledig, L. Theis, F. Huszar, J. Caballero, A. Aitken, A. Tejani,
J. Totz, Z. Wang, and W. Shi. Photo-realistic single image super-
resolution using a generative adversarial network. In CVPR, 2017.
2
[14] T.-Y. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan,
P. Dollr, and C. L. Zitnick. Microsoft coco: Common objects in
context. In ECCV, 2014. 1, 5
[15] E. Mansimov, E. Parisotto, L. J. Ba, and R. Salakhutdinov. Generat-
ing images from captions with attention. In ICLR, 2016. 2
[16] A. Nguyen, J. Yosinski, Y. Bengio, A. Dosovitskiy, and J. Clune.
Plug & play generative networks: Conditional iterative generation of
images in latent space. In CVPR, 2017. 2, 5, 7
[17] A. Radford, L. Metz, and S. Chintala. Unsupervised representation
learning with deep convolutional generative adversarial networks. In
ICLR, 2016. 2
[18] S. Reed, Z. Akata, S. Mohan, S. Tenka, B. Schiele, and H. Lee.
Learning what and where to draw. In NIPS, 2016. 1, 2, 5, 7
[19] S. Reed, Z. Akata, B. Schiele, and H. Lee. Learning deep represen-
tations of fine-grained visual descriptions. In CVPR, 2016. 1, 6
[20] S. Reed, Z. Akata, X. Yan, L. Logeswaran, B. Schiele, and H. Lee.
Generative adversarial text-to-image synthesis. In ICML, 2016. 1, 2,
5, 7
[21] S. E. Reed, A. van den Oord, N. Kalchbrenner, S. G. Colmenarejo,
Z. Wang, Y. Chen, D. Belov, and N. de Freitas. Parallel multiscale
autoregressive density estimation. In ICML, 2017. 2
[22] O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma,
Z. Huang, A. Karpathy, A. Khosla, M. Bernstein, A. C. Berg, and
L. Fei-Fei. ImageNet Large Scale Visual Recognition Challenge.
IJCV, 115(3):211–252, 2015. 4
[23] T. Salimans, I. J. Goodfellow, W. Zaremba, V. Cheung, A. Radford,
and X. Chen. Improved techniques for training gans. In NIPS, 2016.
2, 5
[24] M. Schuster and K. K. Paliwal. Bidirectional recurrent neural net-
works. IEEE Trans. Signal Processing, 45(11):2673–2681, 1997. 4
[25] C. Szegedy, V. Vanhoucke, S. Ioffe, J. Shlens, and Z. Wojna. Re-
thinking the inception architecture for computer vision. In CVPR,
2016. 4
[26] A. van den Oord, N. Kalchbrenner, O. Vinyals, L. Espeholt,
A. Graves, and K. Kavukcuoglu. Conditional image generation with
pixelcnn decoders. In NIPS, 2016. 2
[27] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N.
Gomez, L. Kaiser, and I. Polosukhin. Attention is all you need.
arXiv:1706.03762, 2017. 2
[28] C. Wah, S. Branson, P. Welinder, P. Perona, and S. Belongie. The
Caltech-UCSD Birds-200-2011 Dataset. Technical Report CNS-TR-
2011-001, California Institute of Technology, 2011. 5
[29] K. Xu, J. Ba, R. Kiros, K. Cho, A. C. Courville, R. Salakhutdinov,
R. S. Zemel, and Y. Bengio. Show, attend and tell: Neural image
caption generation with visual attention. In ICML, 2015. 1, 2
[30] Z. Yang, X. He, J. Gao, L. Deng, and A. J. Smola. Stacked attention
networks for image question answering. In CVPR, 2016. 1, 2
[31] H. Zhang, T. Xu, H. Li, S. Zhang, X. Wang, X. Huang, and
D. Metaxas. Stackgan: Text to photo-realistic image synthesis with
stacked generative adversarial networks. In ICCV, 2017. 1, 2, 3, 5, 7
[32] H. Zhang, T. Xu, H. Li, S. Zhang, X. Wang, X. Huang, and D. N.
Metaxas. Stackgan++: Realistic image synthesis with stacked gen-
erative adversarial networks. arXiv: 1710.10916, 2017. 1, 2, 3, 5,
7
Supplementary Material
Due to the size limit, more examples are available in the
appendix, which can be download from this link.
Ad

More Related Content

What's hot (20)

Color Image Watermarking Application for ERTU Cloud
Color Image Watermarking Application for ERTU CloudColor Image Watermarking Application for ERTU Cloud
Color Image Watermarking Application for ERTU Cloud
CSCJournals
 
Graph Signal Processing for Machine Learning A Review and New Perspectives - ...
Graph Signal Processing for Machine Learning A Review and New Perspectives - ...Graph Signal Processing for Machine Learning A Review and New Perspectives - ...
Graph Signal Processing for Machine Learning A Review and New Perspectives - ...
lauratoni4
 
IRJET- Devnagari Text Detection
IRJET- Devnagari Text DetectionIRJET- Devnagari Text Detection
IRJET- Devnagari Text Detection
IRJET Journal
 
Color Restoration of Scanned Archaeological Artifacts with Repetitive Patterns
Color Restoration of Scanned Archaeological Artifacts with Repetitive PatternsColor Restoration of Scanned Archaeological Artifacts with Repetitive Patterns
Color Restoration of Scanned Archaeological Artifacts with Repetitive Patterns
Gravitate Project
 
Big Data Processing using a AWS Dataset
Big Data Processing using a AWS DatasetBig Data Processing using a AWS Dataset
Big Data Processing using a AWS Dataset
Vishva Abeyrathne
 
Pioneering VDT Image Compression using Block Coding
Pioneering VDT Image Compression using Block CodingPioneering VDT Image Compression using Block Coding
Pioneering VDT Image Compression using Block Coding
DR.P.S.JAGADEESH KUMAR
 
A1804010105
A1804010105A1804010105
A1804010105
IOSR Journals
 
Classification accuracy of sar images for various land
Classification accuracy of sar images for various landClassification accuracy of sar images for various land
Classification accuracy of sar images for various land
eSAT Publishing House
 
A 3 d graphic database system for content based retrival
A 3 d graphic database system for content based retrivalA 3 d graphic database system for content based retrival
A 3 d graphic database system for content based retrival
Ma Waing
 
Steganalysis of LSB Embedded Images Using Gray Level Co-Occurrence Matrix
Steganalysis of LSB Embedded Images Using Gray Level Co-Occurrence MatrixSteganalysis of LSB Embedded Images Using Gray Level Co-Occurrence Matrix
Steganalysis of LSB Embedded Images Using Gray Level Co-Occurrence Matrix
CSCJournals
 
An Image representation using Compressive Sensing and Arithmetic Coding
An Image representation using Compressive Sensing and Arithmetic Coding   An Image representation using Compressive Sensing and Arithmetic Coding
An Image representation using Compressive Sensing and Arithmetic Coding
IJCERT
 
Vector sparse representation of color image using quaternion matrix analysis
Vector sparse representation of color image using quaternion matrix analysisVector sparse representation of color image using quaternion matrix analysis
Vector sparse representation of color image using quaternion matrix analysis
redpel dot com
 
Ijetcas14 527
Ijetcas14 527Ijetcas14 527
Ijetcas14 527
Iasir Journals
 
B colouring
B colouringB colouring
B colouring
xs76250
 
Density maximization for improving graph matching with its applications
Density maximization for improving graph matching with its applicationsDensity maximization for improving graph matching with its applications
Density maximization for improving graph matching with its applications
I3E Technologies
 
CORRELATION OF EIGENVECTOR CENTRALITY TO OTHER CENTRALITY MEASURES: RANDOM, S...
CORRELATION OF EIGENVECTOR CENTRALITY TO OTHER CENTRALITY MEASURES: RANDOM, S...CORRELATION OF EIGENVECTOR CENTRALITY TO OTHER CENTRALITY MEASURES: RANDOM, S...
CORRELATION OF EIGENVECTOR CENTRALITY TO OTHER CENTRALITY MEASURES: RANDOM, S...
csandit
 
Vector sparse representation of color image using quaternion matrix analysis
Vector sparse representation of color image using quaternion matrix analysisVector sparse representation of color image using quaternion matrix analysis
Vector sparse representation of color image using quaternion matrix analysis
parry prabhu
 
Automated Neural Image Caption Generator for Visually Impaired People
Automated Neural Image Caption Generator for Visually Impaired PeopleAutomated Neural Image Caption Generator for Visually Impaired People
Automated Neural Image Caption Generator for Visually Impaired People
Christopher Mehdi Elamri
 
Digital image compression techniques
Digital image compression techniquesDigital image compression techniques
Digital image compression techniques
eSAT Publishing House
 
A CHAOTIC CONFUSION-DIFFUSION IMAGE ENCRYPTION BASED ON HENON MAP
A CHAOTIC CONFUSION-DIFFUSION IMAGE ENCRYPTION BASED ON HENON MAPA CHAOTIC CONFUSION-DIFFUSION IMAGE ENCRYPTION BASED ON HENON MAP
A CHAOTIC CONFUSION-DIFFUSION IMAGE ENCRYPTION BASED ON HENON MAP
IJNSA Journal
 
Color Image Watermarking Application for ERTU Cloud
Color Image Watermarking Application for ERTU CloudColor Image Watermarking Application for ERTU Cloud
Color Image Watermarking Application for ERTU Cloud
CSCJournals
 
Graph Signal Processing for Machine Learning A Review and New Perspectives - ...
Graph Signal Processing for Machine Learning A Review and New Perspectives - ...Graph Signal Processing for Machine Learning A Review and New Perspectives - ...
Graph Signal Processing for Machine Learning A Review and New Perspectives - ...
lauratoni4
 
IRJET- Devnagari Text Detection
IRJET- Devnagari Text DetectionIRJET- Devnagari Text Detection
IRJET- Devnagari Text Detection
IRJET Journal
 
Color Restoration of Scanned Archaeological Artifacts with Repetitive Patterns
Color Restoration of Scanned Archaeological Artifacts with Repetitive PatternsColor Restoration of Scanned Archaeological Artifacts with Repetitive Patterns
Color Restoration of Scanned Archaeological Artifacts with Repetitive Patterns
Gravitate Project
 
Big Data Processing using a AWS Dataset
Big Data Processing using a AWS DatasetBig Data Processing using a AWS Dataset
Big Data Processing using a AWS Dataset
Vishva Abeyrathne
 
Pioneering VDT Image Compression using Block Coding
Pioneering VDT Image Compression using Block CodingPioneering VDT Image Compression using Block Coding
Pioneering VDT Image Compression using Block Coding
DR.P.S.JAGADEESH KUMAR
 
Classification accuracy of sar images for various land
Classification accuracy of sar images for various landClassification accuracy of sar images for various land
Classification accuracy of sar images for various land
eSAT Publishing House
 
A 3 d graphic database system for content based retrival
A 3 d graphic database system for content based retrivalA 3 d graphic database system for content based retrival
A 3 d graphic database system for content based retrival
Ma Waing
 
Steganalysis of LSB Embedded Images Using Gray Level Co-Occurrence Matrix
Steganalysis of LSB Embedded Images Using Gray Level Co-Occurrence MatrixSteganalysis of LSB Embedded Images Using Gray Level Co-Occurrence Matrix
Steganalysis of LSB Embedded Images Using Gray Level Co-Occurrence Matrix
CSCJournals
 
An Image representation using Compressive Sensing and Arithmetic Coding
An Image representation using Compressive Sensing and Arithmetic Coding   An Image representation using Compressive Sensing and Arithmetic Coding
An Image representation using Compressive Sensing and Arithmetic Coding
IJCERT
 
Vector sparse representation of color image using quaternion matrix analysis
Vector sparse representation of color image using quaternion matrix analysisVector sparse representation of color image using quaternion matrix analysis
Vector sparse representation of color image using quaternion matrix analysis
redpel dot com
 
B colouring
B colouringB colouring
B colouring
xs76250
 
Density maximization for improving graph matching with its applications
Density maximization for improving graph matching with its applicationsDensity maximization for improving graph matching with its applications
Density maximization for improving graph matching with its applications
I3E Technologies
 
CORRELATION OF EIGENVECTOR CENTRALITY TO OTHER CENTRALITY MEASURES: RANDOM, S...
CORRELATION OF EIGENVECTOR CENTRALITY TO OTHER CENTRALITY MEASURES: RANDOM, S...CORRELATION OF EIGENVECTOR CENTRALITY TO OTHER CENTRALITY MEASURES: RANDOM, S...
CORRELATION OF EIGENVECTOR CENTRALITY TO OTHER CENTRALITY MEASURES: RANDOM, S...
csandit
 
Vector sparse representation of color image using quaternion matrix analysis
Vector sparse representation of color image using quaternion matrix analysisVector sparse representation of color image using quaternion matrix analysis
Vector sparse representation of color image using quaternion matrix analysis
parry prabhu
 
Automated Neural Image Caption Generator for Visually Impaired People
Automated Neural Image Caption Generator for Visually Impaired PeopleAutomated Neural Image Caption Generator for Visually Impaired People
Automated Neural Image Caption Generator for Visually Impaired People
Christopher Mehdi Elamri
 
Digital image compression techniques
Digital image compression techniquesDigital image compression techniques
Digital image compression techniques
eSAT Publishing House
 
A CHAOTIC CONFUSION-DIFFUSION IMAGE ENCRYPTION BASED ON HENON MAP
A CHAOTIC CONFUSION-DIFFUSION IMAGE ENCRYPTION BASED ON HENON MAPA CHAOTIC CONFUSION-DIFFUSION IMAGE ENCRYPTION BASED ON HENON MAP
A CHAOTIC CONFUSION-DIFFUSION IMAGE ENCRYPTION BASED ON HENON MAP
IJNSA Journal
 

Similar to AttnGAN: Fine-Grained Text to Image Generation with Attentional Generative Adversarial Networks (20)

Optimal text-to-image synthesis model for generating portrait images using ge...
Optimal text-to-image synthesis model for generating portrait images using ge...Optimal text-to-image synthesis model for generating portrait images using ge...
Optimal text-to-image synthesis model for generating portrait images using ge...
nooriasukmaningtyas
 
Image Generation with Gans-based Techniques: A Survey
Image Generation with Gans-based Techniques: A SurveyImage Generation with Gans-based Techniques: A Survey
Image Generation with Gans-based Techniques: A Survey
AIRCC Publishing Corporation
 
Fashion AI
Fashion AIFashion AI
Fashion AI
YogeshIJTSRD
 
Learning a Recurrent Visual Representation for Image Caption G.docx
Learning a Recurrent Visual Representation for Image Caption G.docxLearning a Recurrent Visual Representation for Image Caption G.docx
Learning a Recurrent Visual Representation for Image Caption G.docx
croysierkathey
 
Learning a Recurrent Visual Representation for Image Caption G
Learning a Recurrent Visual Representation for Image Caption GLearning a Recurrent Visual Representation for Image Caption G
Learning a Recurrent Visual Representation for Image Caption G
JospehStull43
 
ATTENTION BASED IMAGE CAPTIONING USING DEEP LEARNING
ATTENTION BASED IMAGE CAPTIONING USING DEEP LEARNINGATTENTION BASED IMAGE CAPTIONING USING DEEP LEARNING
ATTENTION BASED IMAGE CAPTIONING USING DEEP LEARNING
Nathan Mathis
 
PointNet
PointNetPointNet
PointNet
PetteriTeikariPhD
 
Image Captioning Generator using Deep Machine Learning
Image Captioning Generator using Deep Machine LearningImage Captioning Generator using Deep Machine Learning
Image Captioning Generator using Deep Machine Learning
ijtsrd
 
Image generation using Aritificial intellegence and Generative Adversarial Ne...
Image generation using Aritificial intellegence and Generative Adversarial Ne...Image generation using Aritificial intellegence and Generative Adversarial Ne...
Image generation using Aritificial intellegence and Generative Adversarial Ne...
rishi722423
 
IPT.pdf
IPT.pdfIPT.pdf
IPT.pdf
Manas Das
 
Обучение нейросетей компьютерного зрения в видеоиграх
Обучение нейросетей компьютерного зрения в видеоиграхОбучение нейросетей компьютерного зрения в видеоиграх
Обучение нейросетей компьютерного зрения в видеоиграх
Anatol Alizar
 
ClusterPaperDaggett
ClusterPaperDaggettClusterPaperDaggett
ClusterPaperDaggett
Thomas Daggett
 
A PROJECT REPORT ON REMOVAL OF UNNECESSARY OBJECTS FROM PHOTOS USING MASKING
A PROJECT REPORT ON REMOVAL OF UNNECESSARY OBJECTS FROM PHOTOS USING MASKINGA PROJECT REPORT ON REMOVAL OF UNNECESSARY OBJECTS FROM PHOTOS USING MASKING
A PROJECT REPORT ON REMOVAL OF UNNECESSARY OBJECTS FROM PHOTOS USING MASKING
IRJET Journal
 
Sinusoidal Function for Population Size in Quantum Evolutionary Algorithm and...
Sinusoidal Function for Population Size in Quantum Evolutionary Algorithm and...Sinusoidal Function for Population Size in Quantum Evolutionary Algorithm and...
Sinusoidal Function for Population Size in Quantum Evolutionary Algorithm and...
sipij
 
PAGE: A Partition Aware Engine for Parallel Graph Computation
PAGE: A Partition Aware Engine for Parallel Graph ComputationPAGE: A Partition Aware Engine for Parallel Graph Computation
PAGE: A Partition Aware Engine for Parallel Graph Computation
1crore projects
 
ADVANCED SINGLE IMAGE RESOLUTION UPSURGING USING A GENERATIVE ADVERSARIAL NET...
ADVANCED SINGLE IMAGE RESOLUTION UPSURGING USING A GENERATIVE ADVERSARIAL NET...ADVANCED SINGLE IMAGE RESOLUTION UPSURGING USING A GENERATIVE ADVERSARIAL NET...
ADVANCED SINGLE IMAGE RESOLUTION UPSURGING USING A GENERATIVE ADVERSARIAL NET...
sipij
 
IMAGE GENERATION FROM CAPTION
IMAGE GENERATION FROM CAPTIONIMAGE GENERATION FROM CAPTION
IMAGE GENERATION FROM CAPTION
ijscai
 
Image Generation from Caption
Image Generation from Caption Image Generation from Caption
Image Generation from Caption
IJSCAI Journal
 
A Multi Criteria Decision Making Based Approach for Semantic Image Annotation
A Multi Criteria Decision Making Based Approach for Semantic Image Annotation A Multi Criteria Decision Making Based Approach for Semantic Image Annotation
A Multi Criteria Decision Making Based Approach for Semantic Image Annotation
ijcax
 
A Multi Criteria Decision Making Based Approach for Semantic Image Annotation
A Multi Criteria Decision Making Based Approach for Semantic Image Annotation A Multi Criteria Decision Making Based Approach for Semantic Image Annotation
A Multi Criteria Decision Making Based Approach for Semantic Image Annotation
ijcax
 
Optimal text-to-image synthesis model for generating portrait images using ge...
Optimal text-to-image synthesis model for generating portrait images using ge...Optimal text-to-image synthesis model for generating portrait images using ge...
Optimal text-to-image synthesis model for generating portrait images using ge...
nooriasukmaningtyas
 
Image Generation with Gans-based Techniques: A Survey
Image Generation with Gans-based Techniques: A SurveyImage Generation with Gans-based Techniques: A Survey
Image Generation with Gans-based Techniques: A Survey
AIRCC Publishing Corporation
 
Learning a Recurrent Visual Representation for Image Caption G.docx
Learning a Recurrent Visual Representation for Image Caption G.docxLearning a Recurrent Visual Representation for Image Caption G.docx
Learning a Recurrent Visual Representation for Image Caption G.docx
croysierkathey
 
Learning a Recurrent Visual Representation for Image Caption G
Learning a Recurrent Visual Representation for Image Caption GLearning a Recurrent Visual Representation for Image Caption G
Learning a Recurrent Visual Representation for Image Caption G
JospehStull43
 
ATTENTION BASED IMAGE CAPTIONING USING DEEP LEARNING
ATTENTION BASED IMAGE CAPTIONING USING DEEP LEARNINGATTENTION BASED IMAGE CAPTIONING USING DEEP LEARNING
ATTENTION BASED IMAGE CAPTIONING USING DEEP LEARNING
Nathan Mathis
 
Image Captioning Generator using Deep Machine Learning
Image Captioning Generator using Deep Machine LearningImage Captioning Generator using Deep Machine Learning
Image Captioning Generator using Deep Machine Learning
ijtsrd
 
Image generation using Aritificial intellegence and Generative Adversarial Ne...
Image generation using Aritificial intellegence and Generative Adversarial Ne...Image generation using Aritificial intellegence and Generative Adversarial Ne...
Image generation using Aritificial intellegence and Generative Adversarial Ne...
rishi722423
 
Обучение нейросетей компьютерного зрения в видеоиграх
Обучение нейросетей компьютерного зрения в видеоиграхОбучение нейросетей компьютерного зрения в видеоиграх
Обучение нейросетей компьютерного зрения в видеоиграх
Anatol Alizar
 
A PROJECT REPORT ON REMOVAL OF UNNECESSARY OBJECTS FROM PHOTOS USING MASKING
A PROJECT REPORT ON REMOVAL OF UNNECESSARY OBJECTS FROM PHOTOS USING MASKINGA PROJECT REPORT ON REMOVAL OF UNNECESSARY OBJECTS FROM PHOTOS USING MASKING
A PROJECT REPORT ON REMOVAL OF UNNECESSARY OBJECTS FROM PHOTOS USING MASKING
IRJET Journal
 
Sinusoidal Function for Population Size in Quantum Evolutionary Algorithm and...
Sinusoidal Function for Population Size in Quantum Evolutionary Algorithm and...Sinusoidal Function for Population Size in Quantum Evolutionary Algorithm and...
Sinusoidal Function for Population Size in Quantum Evolutionary Algorithm and...
sipij
 
PAGE: A Partition Aware Engine for Parallel Graph Computation
PAGE: A Partition Aware Engine for Parallel Graph ComputationPAGE: A Partition Aware Engine for Parallel Graph Computation
PAGE: A Partition Aware Engine for Parallel Graph Computation
1crore projects
 
ADVANCED SINGLE IMAGE RESOLUTION UPSURGING USING A GENERATIVE ADVERSARIAL NET...
ADVANCED SINGLE IMAGE RESOLUTION UPSURGING USING A GENERATIVE ADVERSARIAL NET...ADVANCED SINGLE IMAGE RESOLUTION UPSURGING USING A GENERATIVE ADVERSARIAL NET...
ADVANCED SINGLE IMAGE RESOLUTION UPSURGING USING A GENERATIVE ADVERSARIAL NET...
sipij
 
IMAGE GENERATION FROM CAPTION
IMAGE GENERATION FROM CAPTIONIMAGE GENERATION FROM CAPTION
IMAGE GENERATION FROM CAPTION
ijscai
 
Image Generation from Caption
Image Generation from Caption Image Generation from Caption
Image Generation from Caption
IJSCAI Journal
 
A Multi Criteria Decision Making Based Approach for Semantic Image Annotation
A Multi Criteria Decision Making Based Approach for Semantic Image Annotation A Multi Criteria Decision Making Based Approach for Semantic Image Annotation
A Multi Criteria Decision Making Based Approach for Semantic Image Annotation
ijcax
 
A Multi Criteria Decision Making Based Approach for Semantic Image Annotation
A Multi Criteria Decision Making Based Approach for Semantic Image Annotation A Multi Criteria Decision Making Based Approach for Semantic Image Annotation
A Multi Criteria Decision Making Based Approach for Semantic Image Annotation
ijcax
 
Ad

More from Willy Marroquin (WillyDevNET) (20)

Governance in the Age of Generative AI: A 360º Approach for Resilient Pol...
Governance in the   Age of Generative AI:  A 360º Approach for Resilient  Pol...Governance in the   Age of Generative AI:  A 360º Approach for Resilient  Pol...
Governance in the Age of Generative AI: A 360º Approach for Resilient Pol...
Willy Marroquin (WillyDevNET)
 
Marco Ético para implementación de IA en Colombia
Marco Ético para implementación de IA en ColombiaMarco Ético para implementación de IA en Colombia
Marco Ético para implementación de IA en Colombia
Willy Marroquin (WillyDevNET)
 
Microsoft AI Transformation Partner Playbook.pdf
Microsoft AI Transformation Partner Playbook.pdfMicrosoft AI Transformation Partner Playbook.pdf
Microsoft AI Transformation Partner Playbook.pdf
Willy Marroquin (WillyDevNET)
 
World Economic Forum : The Global Risks Report 2024
World Economic Forum : The Global Risks Report 2024World Economic Forum : The Global Risks Report 2024
World Economic Forum : The Global Risks Report 2024
Willy Marroquin (WillyDevNET)
 
Language Is Not All You Need: Aligning Perception with Language Models
Language Is Not All You Need: Aligning Perception with Language ModelsLanguage Is Not All You Need: Aligning Perception with Language Models
Language Is Not All You Need: Aligning Perception with Language Models
Willy Marroquin (WillyDevNET)
 
Real Time Speech Enhancement in the Waveform Domain
Real Time Speech Enhancement in the Waveform DomainReal Time Speech Enhancement in the Waveform Domain
Real Time Speech Enhancement in the Waveform Domain
Willy Marroquin (WillyDevNET)
 
Data and AI reference architecture
Data and AI reference architectureData and AI reference architecture
Data and AI reference architecture
Willy Marroquin (WillyDevNET)
 
Inteligencia artificial y crecimiento económico. Oportunidades y desafíos par...
Inteligencia artificial y crecimiento económico. Oportunidades y desafíos par...Inteligencia artificial y crecimiento económico. Oportunidades y desafíos par...
Inteligencia artificial y crecimiento económico. Oportunidades y desafíos par...
Willy Marroquin (WillyDevNET)
 
An Artificial Neuron Implemented on an Actual Quantum Processor
An Artificial Neuron Implemented on an Actual Quantum ProcessorAn Artificial Neuron Implemented on an Actual Quantum Processor
An Artificial Neuron Implemented on an Actual Quantum Processor
Willy Marroquin (WillyDevNET)
 
ENFERMEDAD DE ALZHEIMER PRESENTE TERAP...UTICO Y RETOS FUTUROS
ENFERMEDAD DE ALZHEIMER PRESENTE TERAP...UTICO Y RETOS FUTUROSENFERMEDAD DE ALZHEIMER PRESENTE TERAP...UTICO Y RETOS FUTUROS
ENFERMEDAD DE ALZHEIMER PRESENTE TERAP...UTICO Y RETOS FUTUROS
Willy Marroquin (WillyDevNET)
 
The Malicious Use of Artificial Intelligence: Forecasting, Prevention, and...
The Malicious Use   of Artificial Intelligence: Forecasting, Prevention,  and...The Malicious Use   of Artificial Intelligence: Forecasting, Prevention,  and...
The Malicious Use of Artificial Intelligence: Forecasting, Prevention, and...
Willy Marroquin (WillyDevNET)
 
TowardsDeepLearningModelsforPsychological StatePredictionusingSmartphoneData:...
TowardsDeepLearningModelsforPsychological StatePredictionusingSmartphoneData:...TowardsDeepLearningModelsforPsychological StatePredictionusingSmartphoneData:...
TowardsDeepLearningModelsforPsychological StatePredictionusingSmartphoneData:...
Willy Marroquin (WillyDevNET)
 
Deep learning-approach
Deep learning-approachDeep learning-approach
Deep learning-approach
Willy Marroquin (WillyDevNET)
 
WEF new vision for education
WEF new vision for educationWEF new vision for education
WEF new vision for education
Willy Marroquin (WillyDevNET)
 
El futuro del trabajo perspectivas regionales
El futuro del trabajo perspectivas regionalesEl futuro del trabajo perspectivas regionales
El futuro del trabajo perspectivas regionales
Willy Marroquin (WillyDevNET)
 
ASIA Y EL NUEVO (DES)ORDEN MUNDIAL
ASIA Y EL NUEVO (DES)ORDEN MUNDIALASIA Y EL NUEVO (DES)ORDEN MUNDIAL
ASIA Y EL NUEVO (DES)ORDEN MUNDIAL
Willy Marroquin (WillyDevNET)
 
DeepMood: Modeling Mobile Phone Typing Dynamics for Mood Detection
DeepMood: Modeling Mobile Phone Typing Dynamics for Mood DetectionDeepMood: Modeling Mobile Phone Typing Dynamics for Mood Detection
DeepMood: Modeling Mobile Phone Typing Dynamics for Mood Detection
Willy Marroquin (WillyDevNET)
 
FOR A MEANINGFUL ARTIFICIAL INTELLIGENCE TOWARDS A FRENCH AND EUROPEAN ST...
FOR A  MEANINGFUL  ARTIFICIAL  INTELLIGENCE TOWARDS A FRENCH  AND EUROPEAN ST...FOR A  MEANINGFUL  ARTIFICIAL  INTELLIGENCE TOWARDS A FRENCH  AND EUROPEAN ST...
FOR A MEANINGFUL ARTIFICIAL INTELLIGENCE TOWARDS A FRENCH AND EUROPEAN ST...
Willy Marroquin (WillyDevNET)
 
When Will AI Exceed Human Performance? Evidence from AI Experts
When Will AI Exceed Human Performance? Evidence from AI ExpertsWhen Will AI Exceed Human Performance? Evidence from AI Experts
When Will AI Exceed Human Performance? Evidence from AI Experts
Willy Marroquin (WillyDevNET)
 
Microsoft AI Platform Whitepaper
Microsoft AI Platform WhitepaperMicrosoft AI Platform Whitepaper
Microsoft AI Platform Whitepaper
Willy Marroquin (WillyDevNET)
 
Governance in the Age of Generative AI: A 360º Approach for Resilient Pol...
Governance in the   Age of Generative AI:  A 360º Approach for Resilient  Pol...Governance in the   Age of Generative AI:  A 360º Approach for Resilient  Pol...
Governance in the Age of Generative AI: A 360º Approach for Resilient Pol...
Willy Marroquin (WillyDevNET)
 
Language Is Not All You Need: Aligning Perception with Language Models
Language Is Not All You Need: Aligning Perception with Language ModelsLanguage Is Not All You Need: Aligning Perception with Language Models
Language Is Not All You Need: Aligning Perception with Language Models
Willy Marroquin (WillyDevNET)
 
Inteligencia artificial y crecimiento económico. Oportunidades y desafíos par...
Inteligencia artificial y crecimiento económico. Oportunidades y desafíos par...Inteligencia artificial y crecimiento económico. Oportunidades y desafíos par...
Inteligencia artificial y crecimiento económico. Oportunidades y desafíos par...
Willy Marroquin (WillyDevNET)
 
An Artificial Neuron Implemented on an Actual Quantum Processor
An Artificial Neuron Implemented on an Actual Quantum ProcessorAn Artificial Neuron Implemented on an Actual Quantum Processor
An Artificial Neuron Implemented on an Actual Quantum Processor
Willy Marroquin (WillyDevNET)
 
ENFERMEDAD DE ALZHEIMER PRESENTE TERAP...UTICO Y RETOS FUTUROS
ENFERMEDAD DE ALZHEIMER PRESENTE TERAP...UTICO Y RETOS FUTUROSENFERMEDAD DE ALZHEIMER PRESENTE TERAP...UTICO Y RETOS FUTUROS
ENFERMEDAD DE ALZHEIMER PRESENTE TERAP...UTICO Y RETOS FUTUROS
Willy Marroquin (WillyDevNET)
 
The Malicious Use of Artificial Intelligence: Forecasting, Prevention, and...
The Malicious Use   of Artificial Intelligence: Forecasting, Prevention,  and...The Malicious Use   of Artificial Intelligence: Forecasting, Prevention,  and...
The Malicious Use of Artificial Intelligence: Forecasting, Prevention, and...
Willy Marroquin (WillyDevNET)
 
TowardsDeepLearningModelsforPsychological StatePredictionusingSmartphoneData:...
TowardsDeepLearningModelsforPsychological StatePredictionusingSmartphoneData:...TowardsDeepLearningModelsforPsychological StatePredictionusingSmartphoneData:...
TowardsDeepLearningModelsforPsychological StatePredictionusingSmartphoneData:...
Willy Marroquin (WillyDevNET)
 
DeepMood: Modeling Mobile Phone Typing Dynamics for Mood Detection
DeepMood: Modeling Mobile Phone Typing Dynamics for Mood DetectionDeepMood: Modeling Mobile Phone Typing Dynamics for Mood Detection
DeepMood: Modeling Mobile Phone Typing Dynamics for Mood Detection
Willy Marroquin (WillyDevNET)
 
FOR A MEANINGFUL ARTIFICIAL INTELLIGENCE TOWARDS A FRENCH AND EUROPEAN ST...
FOR A  MEANINGFUL  ARTIFICIAL  INTELLIGENCE TOWARDS A FRENCH  AND EUROPEAN ST...FOR A  MEANINGFUL  ARTIFICIAL  INTELLIGENCE TOWARDS A FRENCH  AND EUROPEAN ST...
FOR A MEANINGFUL ARTIFICIAL INTELLIGENCE TOWARDS A FRENCH AND EUROPEAN ST...
Willy Marroquin (WillyDevNET)
 
When Will AI Exceed Human Performance? Evidence from AI Experts
When Will AI Exceed Human Performance? Evidence from AI ExpertsWhen Will AI Exceed Human Performance? Evidence from AI Experts
When Will AI Exceed Human Performance? Evidence from AI Experts
Willy Marroquin (WillyDevNET)
 
Ad

Recently uploaded (20)

Concrete_Presenbmlkvvbvvvfvbbbfcfftation.pptx
Concrete_Presenbmlkvvbvvvfvbbbfcfftation.pptxConcrete_Presenbmlkvvbvvvfvbbbfcfftation.pptx
Concrete_Presenbmlkvvbvvvfvbbbfcfftation.pptx
ssuserd1f4a3
 
Lesson-2.pptxjsjahajauahahagqiqhwjwjahaiq
Lesson-2.pptxjsjahajauahahagqiqhwjwjahaiqLesson-2.pptxjsjahajauahahagqiqhwjwjahaiq
Lesson-2.pptxjsjahajauahahagqiqhwjwjahaiq
AngelPinedaTaguinod
 
Feature Engineering for Electronic Health Record Systems
Feature Engineering for Electronic Health Record SystemsFeature Engineering for Electronic Health Record Systems
Feature Engineering for Electronic Health Record Systems
Process mining Evangelist
 
presentacion.slideshare.informáticaJuridica..pptx
presentacion.slideshare.informáticaJuridica..pptxpresentacion.slideshare.informáticaJuridica..pptx
presentacion.slideshare.informáticaJuridica..pptx
GersonVillatoro4
 
Snowflake training | Snowflake online course
Snowflake training | Snowflake online courseSnowflake training | Snowflake online course
Snowflake training | Snowflake online course
Accentfuture
 
Introduction to systems thinking tools_Eng.pdf
Introduction to systems thinking tools_Eng.pdfIntroduction to systems thinking tools_Eng.pdf
Introduction to systems thinking tools_Eng.pdf
AbdurahmanAbd
 
Storage Devices and the Mechanism of Data Storage in Audio and Visual Form
Storage Devices and the Mechanism of Data Storage in Audio and Visual FormStorage Devices and the Mechanism of Data Storage in Audio and Visual Form
Storage Devices and the Mechanism of Data Storage in Audio and Visual Form
Professional Content Writing's
 
national income & related aggregates (1)(1).pptx
national income & related aggregates (1)(1).pptxnational income & related aggregates (1)(1).pptx
national income & related aggregates (1)(1).pptx
j2492618
 
Introduction to Python_for_machine_learning.pdf
Introduction to Python_for_machine_learning.pdfIntroduction to Python_for_machine_learning.pdf
Introduction to Python_for_machine_learning.pdf
goldenflower34
 
2022.02.07_Bahan DJE Energy Transition Dialogue 2022 kirim.pdf
2022.02.07_Bahan DJE Energy Transition Dialogue 2022 kirim.pdf2022.02.07_Bahan DJE Energy Transition Dialogue 2022 kirim.pdf
2022.02.07_Bahan DJE Energy Transition Dialogue 2022 kirim.pdf
RomiRomeo
 
Time series analysis & forecasting-Day1.pptx
Time series analysis & forecasting-Day1.pptxTime series analysis & forecasting-Day1.pptx
Time series analysis & forecasting-Day1.pptx
AsmaaMahmoud89
 
TOAE201-Slides-Chapter 4. Sample theoretical basis (1).pdf
TOAE201-Slides-Chapter 4. Sample theoretical basis (1).pdfTOAE201-Slides-Chapter 4. Sample theoretical basis (1).pdf
TOAE201-Slides-Chapter 4. Sample theoretical basis (1).pdf
NhiV747372
 
Dynamics 365 Business Rules Dynamics Dynamics
Dynamics 365 Business Rules Dynamics DynamicsDynamics 365 Business Rules Dynamics Dynamics
Dynamics 365 Business Rules Dynamics Dynamics
heyoubro69
 
Publication-launch-How-is-Life-for-Children-in-the-Digital-Age-15-May-2025.pdf
Publication-launch-How-is-Life-for-Children-in-the-Digital-Age-15-May-2025.pdfPublication-launch-How-is-Life-for-Children-in-the-Digital-Age-15-May-2025.pdf
Publication-launch-How-is-Life-for-Children-in-the-Digital-Age-15-May-2025.pdf
StatsCommunications
 
From Data to Insight: How News Aggregator APIs Deliver Contextual Intelligence
From Data to Insight: How News Aggregator APIs Deliver Contextual IntelligenceFrom Data to Insight: How News Aggregator APIs Deliver Contextual Intelligence
From Data to Insight: How News Aggregator APIs Deliver Contextual Intelligence
Contify
 
Taking a customer journey with process mining
Taking a customer journey with process miningTaking a customer journey with process mining
Taking a customer journey with process mining
Process mining Evangelist
 
End to End Process Analysis - Cox Communications
End to End Process Analysis - Cox CommunicationsEnd to End Process Analysis - Cox Communications
End to End Process Analysis - Cox Communications
Process mining Evangelist
 
2024 Digital Equity Accelerator Report.pdf
2024 Digital Equity Accelerator Report.pdf2024 Digital Equity Accelerator Report.pdf
2024 Digital Equity Accelerator Report.pdf
dominikamizerska1
 
Red Hat Openshift Training - openshift (1).pptx
Red Hat Openshift Training - openshift (1).pptxRed Hat Openshift Training - openshift (1).pptx
Red Hat Openshift Training - openshift (1).pptx
ssuserf60686
 
Dr. Robert Krug - Expert In Artificial Intelligence
Dr. Robert Krug - Expert In Artificial IntelligenceDr. Robert Krug - Expert In Artificial Intelligence
Dr. Robert Krug - Expert In Artificial Intelligence
Dr. Robert Krug
 
Concrete_Presenbmlkvvbvvvfvbbbfcfftation.pptx
Concrete_Presenbmlkvvbvvvfvbbbfcfftation.pptxConcrete_Presenbmlkvvbvvvfvbbbfcfftation.pptx
Concrete_Presenbmlkvvbvvvfvbbbfcfftation.pptx
ssuserd1f4a3
 
Lesson-2.pptxjsjahajauahahagqiqhwjwjahaiq
Lesson-2.pptxjsjahajauahahagqiqhwjwjahaiqLesson-2.pptxjsjahajauahahagqiqhwjwjahaiq
Lesson-2.pptxjsjahajauahahagqiqhwjwjahaiq
AngelPinedaTaguinod
 
Feature Engineering for Electronic Health Record Systems
Feature Engineering for Electronic Health Record SystemsFeature Engineering for Electronic Health Record Systems
Feature Engineering for Electronic Health Record Systems
Process mining Evangelist
 
presentacion.slideshare.informáticaJuridica..pptx
presentacion.slideshare.informáticaJuridica..pptxpresentacion.slideshare.informáticaJuridica..pptx
presentacion.slideshare.informáticaJuridica..pptx
GersonVillatoro4
 
Snowflake training | Snowflake online course
Snowflake training | Snowflake online courseSnowflake training | Snowflake online course
Snowflake training | Snowflake online course
Accentfuture
 
Introduction to systems thinking tools_Eng.pdf
Introduction to systems thinking tools_Eng.pdfIntroduction to systems thinking tools_Eng.pdf
Introduction to systems thinking tools_Eng.pdf
AbdurahmanAbd
 
Storage Devices and the Mechanism of Data Storage in Audio and Visual Form
Storage Devices and the Mechanism of Data Storage in Audio and Visual FormStorage Devices and the Mechanism of Data Storage in Audio and Visual Form
Storage Devices and the Mechanism of Data Storage in Audio and Visual Form
Professional Content Writing's
 
national income & related aggregates (1)(1).pptx
national income & related aggregates (1)(1).pptxnational income & related aggregates (1)(1).pptx
national income & related aggregates (1)(1).pptx
j2492618
 
Introduction to Python_for_machine_learning.pdf
Introduction to Python_for_machine_learning.pdfIntroduction to Python_for_machine_learning.pdf
Introduction to Python_for_machine_learning.pdf
goldenflower34
 
2022.02.07_Bahan DJE Energy Transition Dialogue 2022 kirim.pdf
2022.02.07_Bahan DJE Energy Transition Dialogue 2022 kirim.pdf2022.02.07_Bahan DJE Energy Transition Dialogue 2022 kirim.pdf
2022.02.07_Bahan DJE Energy Transition Dialogue 2022 kirim.pdf
RomiRomeo
 
Time series analysis & forecasting-Day1.pptx
Time series analysis & forecasting-Day1.pptxTime series analysis & forecasting-Day1.pptx
Time series analysis & forecasting-Day1.pptx
AsmaaMahmoud89
 
TOAE201-Slides-Chapter 4. Sample theoretical basis (1).pdf
TOAE201-Slides-Chapter 4. Sample theoretical basis (1).pdfTOAE201-Slides-Chapter 4. Sample theoretical basis (1).pdf
TOAE201-Slides-Chapter 4. Sample theoretical basis (1).pdf
NhiV747372
 
Dynamics 365 Business Rules Dynamics Dynamics
Dynamics 365 Business Rules Dynamics DynamicsDynamics 365 Business Rules Dynamics Dynamics
Dynamics 365 Business Rules Dynamics Dynamics
heyoubro69
 
Publication-launch-How-is-Life-for-Children-in-the-Digital-Age-15-May-2025.pdf
Publication-launch-How-is-Life-for-Children-in-the-Digital-Age-15-May-2025.pdfPublication-launch-How-is-Life-for-Children-in-the-Digital-Age-15-May-2025.pdf
Publication-launch-How-is-Life-for-Children-in-the-Digital-Age-15-May-2025.pdf
StatsCommunications
 
From Data to Insight: How News Aggregator APIs Deliver Contextual Intelligence
From Data to Insight: How News Aggregator APIs Deliver Contextual IntelligenceFrom Data to Insight: How News Aggregator APIs Deliver Contextual Intelligence
From Data to Insight: How News Aggregator APIs Deliver Contextual Intelligence
Contify
 
Taking a customer journey with process mining
Taking a customer journey with process miningTaking a customer journey with process mining
Taking a customer journey with process mining
Process mining Evangelist
 
End to End Process Analysis - Cox Communications
End to End Process Analysis - Cox CommunicationsEnd to End Process Analysis - Cox Communications
End to End Process Analysis - Cox Communications
Process mining Evangelist
 
2024 Digital Equity Accelerator Report.pdf
2024 Digital Equity Accelerator Report.pdf2024 Digital Equity Accelerator Report.pdf
2024 Digital Equity Accelerator Report.pdf
dominikamizerska1
 
Red Hat Openshift Training - openshift (1).pptx
Red Hat Openshift Training - openshift (1).pptxRed Hat Openshift Training - openshift (1).pptx
Red Hat Openshift Training - openshift (1).pptx
ssuserf60686
 
Dr. Robert Krug - Expert In Artificial Intelligence
Dr. Robert Krug - Expert In Artificial IntelligenceDr. Robert Krug - Expert In Artificial Intelligence
Dr. Robert Krug - Expert In Artificial Intelligence
Dr. Robert Krug
 

AttnGAN: Fine-Grained Text to Image Generation with Attentional Generative Adversarial Networks

  • 1. AttnGAN: Fine-Grained Text to Image Generation with Attentional Generative Adversarial Networks Tao Xu∗1 , Pengchuan Zhang2 , Qiuyuan Huang2 , Han Zhang3 , Zhe Gan4 , Xiaolei Huang1 , Xiaodong He2 1 Lehigh University 2 Microsoft Research 3 Rutgers University 4 Duke University {tax313, xih206}@lehigh.edu, {penzhan, qihua, xiaohe}@microsoft.com han.zhang@cs.rutgers.edu, zhe.gan@duke.edu Abstract In this paper, we propose an Attentional Generative Ad- versarial Network (AttnGAN) that allows attention-driven, multi-stage refinement for fine-grained text-to-image gener- ation. With a novel attentional generative network, the At- tnGAN can synthesize fine-grained details at different sub- regions of the image by paying attentions to the relevant words in the natural language description. In addition, a deep attentional multimodal similarity model is proposed to compute a fine-grained image-text matching loss for train- ing the generator. The proposed AttnGAN significantly out- performs the previous state of the art, boosting the best re- ported inception score by 14.14% on the CUB dataset and 170.25% on the more challenging COCO dataset. A de- tailed analysis is also performed by visualizing the atten- tion layers of the AttnGAN. It for the first time shows that the layered attentional GAN is able to automatically select the condition at the word level for generating different parts of the image. 1. Introduction Automatically generating images according to natural language descriptions is a fundamental problem in many applications, such as art generation and computer-aided de- sign. It also drives research progress in multimodal learning and inference across vision and language, which is one of the most active research areas in recent years [20, 18, 31, 19, 4, 29, 5, 1, 30] Most recently proposed text-to-image synthesis methods are based on Generative Adversarial Networks (GANs) [6]. A commonly used approach is to encode the whole text de- scription into a global sentence vector as the condition for GAN-based image generation [20, 18, 31, 32]. Although ∗work was performed when was an intern with Microsoft Research this bird is red with white and has a very short beak 10:short 3:red 11:beak 9:very 8:a 3:red 5:white 1:bird 10:short 0:this Figure 1. Example results of the proposed AttnGAN. The first row gives the low-to-high resolution images generated by G0, G1 and G2 of the AttnGAN; the second and third row shows the top-5 most attended words by Fattn 1 and Fattn 2 of the AttnGAN, re- spectively. Here, images of G0 and G1 are bilinearly upsampled to have the same size as that of G2 for better visualization. impressive results have been presented, conditioning GAN only on the global sentence vector lacks important fine- grained information at the word level, and prevents the gen- eration of high quality images. This problem becomes even more severe when generating complex scenes such as those in the COCO dataset [14]. To address this issue, we propose an Attentional Genera- tive Adversarial Network (AttnGAN) that allows attention- driven, multi-stage refinement for fine-grained text-to- image generation. The overall architecture of the AttnGAN is illustrated in Figure 2. The model consists of two novel components. The first component is an attentional gener- 1 arXiv:1711.10485v1[cs.CV]28Nov2017
  • 2. ative network, in which an attention mechanism is devel- oped for the generator to draw different sub-regions of the image by focusing on words that are most relevant to the sub-region being drawn (see Figure 1). More specifically, besides encoding the natural language description into a global sentence vector, each word in the sentence is also encoded into a word vector. The generative network uti- lizes the global sentence vector to generate a low-resolution image in the first stage. In the following stages, it uses the image vector in each sub-region to query word vectors by using an attention layer to form a word-context vector. It then combines the regional image vector and the corre- sponding word-context vector to form a multimodal context vector, based on which the model generates new image fea- tures in the surrounding sub-regions. This effectively yields a higher resolution picture with more details at each stage. The other component in the AttnGAN is a Deep Attentional Multimodal Similarity Model (DAMSM). With an attention mechanism, the DAMSM is able to compute the similarity between the generated image and the sentence using both the global sentence level information and the fine-grained word level information. Thus, the DAMSM provides an ad- ditional fine-grained image-text matching loss for training the generator. The contribution of our method is threefold. (i) An Attentional Generative Adversarial Network is proposed for synthesizing images from text descriptions. Specif- ically, two novel components are proposed in the At- tnGAN, including the attentional generative network and the DAMSM. (ii) Comprehensive study is carried out to em- pirically evaluate the proposed AttnGAN. Experimental re- sults show that the AttnGAN significantly outperforms pre- vious state-of-the-art GAN models. (iii) A detailed analy- sis is performed through visualizing the attention layers of the AttnGAN. For the first time, it is demonstrated that the layered conditional GAN is able to automatically attend to relevant words to form the condition for image generation. 2. Related Work Generating high resolution images from text descrip- tions, though very challenging, is important for many prac- tical applications such as art generation and computer- aided design. Recently, great progress has been achieved in this direction with the emergence of deep generative models [12, 26, 6]. Mansimov et al. [15] built the align- DRAW model, extending the Deep Recurrent Attention Writer (DRAW) [7] to iteratively draw image patches while attending to the relevant words in the caption. Nguyen et al. [16] proposed an approximate Langevin approach to generate images from captions. Reed et al. [21] used conditional PixelCNN [26] to synthesize images from text with a multi-scale model structure. Compared with other deep generative models, Generative Adversarial Networks (GANs) [6] have shown great performance for generating sharper samples [17, 3, 23, 13, 10]. Reed et al. [20] first showed that the conditional GAN was capable of synthesiz- ing plausible images from text descriptions. Their follow- up work [18] also demonstrated that GAN was able to gen- erate better samples by incorporating additional conditions (e.g., object locations). Zhang et al. [31, 32] stacked sev- eral GANs for text-to-image synthesis and used different GANs to generate images of different sizes. However, all of their GANs are conditioned on the global sentence vec- tor, missing fine-grained word level information for image generation. The attention mechanism has recently become an inte- gral part of sequence transduction models. It has been suc- cessfully used in modeling multi-level dependencies in im- age captioning [29], image question answering [30] and machine translation [2]. Vaswani et al. [27] also demon- strated that machine translation models could achieve state- of-the-art results by solely using an attention model. In spite of these progress, the attention mechanism has not been explored in GANs for text-to-image synthesis yet. It is worth mentioning that the alignDRAW [15] also used LAP- GAN [3] to scale the image to a higher resolution. How- ever, the GAN in their framework was only utilized as a post-processing step without attention. To our knowledge, the proposed AttnGAN for the first time develops an atten- tion mechanism that enables GANs to generate fine-grained high quality images via multi-level (e.g., word level and sentence level) conditioning. 3. Attentional Generative Adversarial Net- work As shown in Figure 2, the proposed Attentional Gener- ative Adversarial Network (AttnGAN) has two novel com- ponents: the attentional generative network and the deep attentional multimodal similarity model. We will elaborate each of them in the rest of this section. 3.1. Attentional Generative Network Current GAN-based models for text-to-image genera- tion [20, 18, 31, 32] typically encode the whole-sentence text description into a single vector as the condition for im- age generation, but lack fine-grained word level informa- tion. In this section, we propose a novel attention model that enables the generative network to draw different sub- regions of the image conditioned on words that are most relevant to those sub-regions. As shown in Figure 2, the proposed attentional genera- tive network has m generators (G0, G1, ..., Gm−1), which take the hidden states (h0, h1, ..., hm−1) as input and gen- erate images of small-to-large scales (ˆx0, ˆx1, ..., ˆxm−1).
  • 3. 256x256x3 Attentional Generative Network c z~N(0,I) h2h1h0 D0 128x128x364x64x3 Text Encoder sentence feature word features Attention models Local image features Deep Attentional Multimodal Similarity Model (DAMSM) Conv3x3JoiningUpsamplingFC with reshapeResidual this bird is red with white and has a very short beak D2D1 attn F1 attn F2 ca F F0F0 F1F1 F2F2 G2 G1G0 Image Encoder Figure 2. The architecture of the proposed AttnGAN. Each attention model automatically retrieves the conditions (i.e., the most relevant word vectors) for generating different sub-regions of the image; the DAMSM provides the fine-grained image-text matching loss for the generative network. Specifically, h0 = F0(z, Fca (e)); hi = Fi(hi−1, Fattn i (e, hi−1)) for i = 1, 2, ..., m − 1; ˆxi = Gi(hi). (1) Here, z is a noise vector usually sampled from a standard normal distribution. e is a global sentence vector, and e is the matrix of word vectors. Fca represents the Conditioning Augmentation [31] that converts the sentence vector e to the conditioning vector. Fattn i is the proposed attention model at the ith stage of the AttnGAN. Fca , Fattn i , Fi, and Gi are modeled as neural networks. The attention model Fattn (e, h) has two inputs: the word features e ∈ RD×T and the image features from the previous hidden layer h ∈ R ˆD×N . The word features are first converted into the common semantic space of the im- age features by adding a new perceptron layer, i.e., e = Ue, where U ∈ R ˆD×D . Then, a word-context vector is com- puted for each sub-region of the image based on its hidden features h (query). Each column of h is a feature vector of a sub-region of the image. For the jth sub-region, its word- context vector is a dynamic representation of word vectors relevant to hj, which is calculated by cj = T −1 i=0 βj,iei, where βj,i = exp(sj,i) T −1 k=0 exp(sj,k) , (2) sj,i = hT j ei, and βj,i indicates the weight the model attends to the ith word when generating the jth sub-region of the image. We then donate the word-context matrix for image feature set h by Fattn (e, h) = (c0, c1, ..., cN−1) ∈ R ˆD×N . Finally, image features and the corresponding word-context features are combined to generate images at the next stage. To generate realistic images with multiple levels (i.e., sentence level and word level) of conditions, the final objec- tive function of the attentional generative network is defined as L = LG + λLDAMSM , where LG = m−1 i=0 LGi . (3) Here, λ is a hyperparameter to balance the two terms of Eq. (3). The first term is the GAN loss that jointly approx- imates conditional and unconditional distributions [32]. At the ith stage of the AttnGAN, the generator Gi has a cor- responding discriminator Di. The adversarial loss for Gi is defined as LGi = − 1 2 Eˆxi∼pGi [log(Di(ˆxi)] unconditional loss − 1 2 Eˆxi∼pGi [log(Di(ˆxi, e)] conditional loss , (4) where the unconditional loss determines whether the image is real or fake while the conditional loss determines whether the image and the sentence match or not. Alternately to the training of Gi, each discriminator Di is trained to classify the input into the class of real or fake by minimizing the cross-entropy loss defined by LDi = − 1 2 Exi∼pdatai [log Di(xi)] − 1 2 Eˆxi∼pGi [log(1 − Di(ˆxi)] unconditional loss + − 1 2 Exi∼pdatai [log Di(xi, e)] − 1 2 Eˆxi∼pGi [log(1 − Di(ˆxi, e)] conditional loss , (5) where xi is from the true image distribution pdatai at the ith scale, and ˆxi is from the model distribution pGi at the
  • 4. same scale. Discriminators of the AttnGAN are structurally disjoint, so they can be trained in parallel and each of them focuses on a single image scale. The second term of Eq. (3), LDAMSM , is a word level fine-grained image-text matching loss computed by the DAMSM, which will be elaborated in Subsection 3.2. 3.2. Deep Attentional Multimodal Similarity Model The DAMSM learns two neural networks that map sub- regions of the image and words of the sentence to a common semantic space, thus measures the image-text similarity at the word level to compute a fine-grained loss for image gen- eration. The text encoder is a bi-directional Long Short-Term Memory (LSTM) [24] that extracts semantic vectors from the text description. In the bi-directional LSTM, each word corresponds to two hidden states, one for each direction. Thus, we concatenate its two hidden states to represent the semantic meaning of a word. The feature matrix of all words is indicated by e ∈ RD×T . Its ith column ei is the feature vector for the ith word. D is the dimension of the word vector and T is the number of words. Meanwhile, the last hidden states of the bi-directional LSTM are concate- nated to be the global sentence vector, denoted by e ∈ RD . The image encoder is a Convolutional Neural Network (CNN) that maps images to semantic vectors. The inter- mediate layers of the CNN learn local features of different sub-regions of the image, while the later layers learn global features of the image. More specifically, our image en- coder is built upon the Inception-v3 model [25] pretrained on ImageNet [22]. We first rescale the input image to be 299×299 pixels. And then, we extract the local feature ma- trix f ∈ R768×289 (reshaped from 768×17×17) from the “mixed 6e” layer of Inception-v3. Each column of f is the feature vector of a sub-region of the image. 768 is the di- mension of the local feature vector, and 289 is the number of sub-regions in the image. Meanwhile, the global feature vector f ∈ R2048 is extracted from the last average pooling layer of Inception-v3. Finally, we convert the image fea- tures to a common semantic space of text features by adding a perceptron layer: v = Wf , v = W f, (6) where v ∈ RD×289 and its ith column vi is the visual fea- ture vector for the ith sub-region of the image; and v ∈ RD is the global vector for the whole image. D is the dimension of the multimodal (i.e., image and text modalities) feature space. For efficiency, all parameters in layers built from the Inception-v3 model are fixed, and the parameters in newly added layers are jointly learned with the rest of the net- work. The attention-driven image-text matching score is designed to measure the matching of an image-sentence pair based on an attention model between the image and the text. We first calculate the similarity matrix for all possible pairs of words in the sentence and sub-regions in the image by s = eT v, (7) where s ∈ RT ×289 and si,j is the dot-product similarity between the ith word of the sentence and the jth sub-region of the image. We find that it is beneficial to normalize the similarity matrix as follows si,j = exp(si,j) T −1 k=0 exp(sk,j) . (8) Then, we build an attention model to compute a region- context vector for each word (query). The region-context vector ci is a dynamic representation of the image’s sub- regions related to the ith word of the sentence. It is com- puted as the weighted sum over all regional visual vectors, i.e., ci = 288 j=0 αjvj, where αj = exp(γ1si,j) 288 k=0 exp(γ1si,k) . (9) Here, γ1 is a factor that determines how much attention is paid to features of its relevant sub-regions when computing the region-context vector for a word. Finally, we define the relevance between the ith word and the image using the cosine similarity between ci and ei, i.e., R(ci, ei) = (cT i ei)/(||ci||||ei||). Inspired by the mini- mum classification error formulation in speech recognition (see, e.g., [11, 8]), the attention-driven image-text match- ing score between the entire image (Q) and the whole text description (D) is defined as R(Q, D) = log T −1 i=1 exp(γ2R(ci, ei)) 1 γ2 , (10) where γ2 is a factor that determines how much to mag- nify the importance of the most relevant word-to-region- context pair. When γ2 → ∞, R(Q, D) approximates to maxT −1 i=1 R(ci, ei). The DAMSM loss is designed to learn the attention model in a semi-supervised manner, in which the only su- pervision is the matching between entire images and whole sentences (a sequence of words). Similar to [4, 9], for a batch of image-sentence pairs {(Qi, Di)}M i=1, the posterior probability of sentence Di being matching with image Qi is computed as P(Di|Qi) = exp(γ3R(Qi, Di)) M j=1 exp(γ3R(Qi, Dj)) , (11) where γ3 is a smoothing factor determined by experiments. In this batch of sentences, only Di matches the image Qi,
  • 5. and treat all other M − 1 sentences as mismatching de- scriptions. Following [4, 9], we define the loss function as the negative log posterior probability that the images are matched with their corresponding text descriptions (ground truth), i.e., Lw 1 = − M i=1 log P(Di|Qi), (12) where ‘w’ stands for “word”. Symmetrically, we also minimize Lw 2 = − M i=1 log P(Qi|Di), (13) where P(Qi|Di) = exp(γ3R(Qi,Di)) M j=1 exp(γ3R(Qj ,Di)) is the posterior probability that sentence Di is matched with its correspond- ing image Qi. If we redefine Eq. (10) by R(Q, D) = vT e / ||v||||e|| and substitute it to Eq. (11), (12) and (13), we can obtain loss functions Ls 1 and Ls 2 (where ‘s’ stands for “sentence”) using the sentence vector e and the global image vector v. Finally, the DAMSM loss is defined as LDAMSM = Lw 1 + Lw 2 + Ls 1 + Ls 2. (14) Based on experiments on a held-out validation set, we set the hyperparameters in this section as: γ1 = 5, γ2 = 5, γ3 = 10 and M = 50. Our DAMSM is pretrained 1 by minimizing LDAMSM using real image-text pairs. Since the size of images for pretraining DAMSM is not limited by the size of images that can be generated, real images of size 299×299 are utilized. In addition, the pretrained text- encoder in the DAMSM provides visually-discriminative word vectors learned from image-text paired data for the attentional generative network. In comparison, conven- tional word vectors pretrained on pure text data are often not visually-discriminative, e.g., word vectors of different colors, such as red, blue, yellow, etc., are often clustered together in the vector space, due to the lack of grounding them to the actual visual signals. In sum, we propose two novel attention models, the at- tentional generative network and the DAMSM, which play different roles in the AttnGAN. (i) The attention mechanism in the generative network (see Eq. 2) enables the AttnGAN to automatically select word level condition for generating different sub-regions of the image. (ii) With an attention mechanism (see Eq. 9), the DAMSM is able to compute the fine-grained text-image matching loss LDAMSM . It is worth mentioning that, LDAMSM is applied only on the output of the last generator Gm−1, because the eventual goal of the AttnGAN is to generate large images by the last 1We also finetuned the DAMSM with the whole network, however the performance was not improved. Dataset CUB [28] COCO [14] train test train test #samples 8,855 2,933 80k 40k caption/image 10 10 5 5 Table 1. Statistics of datasets. generator. We tried to apply LDAMSM on images of all resolutions generated by (G0, G1, ..., Gm−1). However, the performance was not improved but the computational cost was increased. 4. Experiments Extensive experimentation is carried out to evaluate the proposed AttnGAN. We first study the important compo- nents of the AttnGAN, including the attentional genera- tive network and the DAMSM. Then, we compare our At- tnGAN with previous state-of-the-art GAN models for text- to-image synthesis [31, 32, 20, 18, 16]. Datasets. Same as previous text-to-image meth- ods [31, 32, 20, 18], our method is evaluated on CUB [28] and COCO [14] datasets. We preprocess the CUB dataset according to the method in [31]. Table 1 lists the statistics of datasets. Evaluation. Following Zhang et al. [31], we use the inception score [23] as the quantitative evaluation measure. Since the inception score cannot reflect whether the gener- ated image is well conditioned on the given text description, we propose to use R-precision, a common evaluation met- ric for ranking retrieval results, as a complementary eval- uation metric for the text-to-image synthesis task. If there are R relevant documents for a query, we examine the top R ranked retrieval results of a system, and find that r are relevant, and then by definition, the R-precision is r/R. More specifically, we conduct a retrieval experiment, i.e., we use generated images to query their corresponding text descriptions. First, the image and text encoders learned in our pretrained DAMSM are utilized to extract global feature vectors of the generated images and the given text descrip- tions. And then, we compute cosine similarities between the global image vectors and the global text vectors. Finally, we rank candidate text descriptions for each image in de- scending similarity and find the top r relevant descriptions for computing the R-precision. To compute the inception score and the R-precision, each model generates 30,000 im- ages from randomly selected unseen text descriptions. The candidate text descriptions for each query image consist of one ground truth (i.e., R = 1) and 99 randomly selected mismatching descriptions. Besides quantitative evaluation, we also qualitatively examine the samples generated by our models. Specifi- cally, we visualize the intermediate results with attention learned by the attention models Fattn . As defined in Eq. (2), weights βj,i indicates which words the model at-
  • 6. 3.4 3.6 3.8 4 4.2 4.4 50 150 250 350 450 550 Inceptionscore Epoch AttnGAN1,no DAMSM AttnGAN1,𝞴=0.1 AttnGAN1,𝞴=1 AttnGAN1,𝞴=5 AttnGAN1,𝞴=10 AttnGAN2,𝞴=5 0 10 20 30 40 50 60 70 50 150 250 350 450 550R-precision(%) Epoch AttnGAN1,no DAMSM AttnGAN1,𝞴=0.1 AttnGAN1,𝞴=1 AttnGAN1,𝞴=5 AttnGAN1,𝞴=10 AttnGAN2,𝞴=5 0 5 10 15 20 25 30 10 30 50 70 90 110 130 150 Inceptionscore Epoch AttnGAN1,𝞴=0.1 AttnGAN1,𝞴=1 AttnGAN1,𝞴=10 AttnGAN1,𝞴=50 AttnGAN1,𝞴=100 AttnGAN2,𝞴=50 0 10 20 30 40 50 60 70 80 90 100 10 30 50 70 90 110 130 150 R-precision(%) Epoch AttnGAN1,𝞴=0.1 AttnGAN1,𝞴=1 AttnGAN1,𝞴=10 AttnGAN1,𝞴=50 AttnGAN1,𝞴=100 AttnGAN2,𝞴=50 Figure 3. Inception scores and R-precision rates by our AttnGAN and its variants at different epochs on CUB (top) and COCO (bot- tom) test sets. For the text-to-image synthesis task, R = 1. Method inception score R-precision(%) AttnGAN1, no DAMSM 3.98 ± .04 10.37± 5.88 AttnGAN1, λ = 0.1 4.19 ± .06 16.55± 4.83 AttnGAN1, λ = 1 4.35 ± .05 34.96± 4.02 AttnGAN1, λ = 5 4.35 ± .04 58.65± 5.41 AttnGAN1, λ = 10 4.29 ± .05 63.87± 4.85 AttnGAN2, λ = 5 4.36 ± .03 67.82 ± 4.43 AttnGAN2, λ = 50 25.89 ± .47 85.47 ± 3.69 (COCO) Table 2. The best inception score and the corresponding R- precision rate of each AttnGAN model on CUB (top six rows) and COCO (the last row) test sets. More results in Figure 3. tends to when generating a sub-region of the image, and T −1 i=0 βj,i = 1. We suppress the less-relevant words for an image’s sub-region via ˆβj,i = βj,i, if βj,i > 1/T, 0, otherwise. (15) For better visualization, we fix the word and compute its at- tention weights with N different sub-regions of an image, ˆβ0,i, ˆβ1,i, ..., ˆβN−1,i. We reshape the N attention weights to √ N × √ N pixels, which are then upsampled with Gaus- sian filters to have the same size as the generated images. Limited by the length of the paper, we only visualize the top-5 most attended words (i.e., words with top-5 highest N−1 j=0 ˆβj,i values) for each attention model. 4.1. Component analysis In this section, we first quantitatively evaluate the At- tnGAN and its variants. The results are shown in Table 2 and Figure 3. Our “AttnGAN1” architecture has one atten- tion model and two generators, while the “AttnGAN2” ar- chitecture has two attention models stacked with three gen- erators (see Figure 2). In addition, as illustrated in Figure 4, Figure 5, Figure 6, and Figure 7, we qualitatively examine the images generated by our AttnGAN. The DAMSM loss. To test the proposed LDAMSM , we adjust the value of λ (see Eq. (3)). As shown in Fig- ure 3, a larger λ leads to a significantly higher R-precision rate on both CUB and COCO datasets. On the CUB dataset, when the value of λ is increased from 0.1 to 5, the incep- tion score of the AttnGAN1 is improved from 4.19 to 4.35 and the corresponding R-precision rate is increased from 16.55% to 58.65% (see Table 2). On the COCO dataset, by increasing the value of λ from 0.1 to 50, the AttnGAN1 achieves both high inception score and R-precision rate (see Figure 3). This comparison demonstrates that properly in- creasing the weight of LDAMSM helps to generate higher quality images that are better conditioned on given text de- scriptions. The reason is that the proposed fine-grained image-text matching loss LDAMSM provides additional su- pervision (i.e., word level matching information) for train- ing the generator. Moreover, in our experiments, we do not observe any collapsed nonsensical mode in the visualization of AttnGAN-generated images. It indicates that, with extra supervision, the fine-grained image-text matching loss also helps to stabilize the training process of the AttnGAN. In addition, if we replace the proposed DAMSM sub-network with the text encoder used in [19], on the CUB dataset, the inception score and R-precision drops to 3.98 and 10.37%, respectively (i.e., the “AttnGAN1, no DAMSM” entry in ta- ble 2), which further demonstrates the effectiveness of the proposed LDAMSM . The attentional generative network. As shown in Ta- ble 2 and Figure 3, stacking two attention models in the generative networks not only generates images of a higher resolution (from 128×128 to 256×256 resolution), but also yields higher inception scores on both CUB and COCO datasets. In order to guarantee the image quality, we find the best value of λ for each dataset by increasing the value of λ until the overall inception score is starting to drop on a held-out validation set. “AttnGAN1” models are built for searching the best λ, based on which a “AttnGAN2” model is built to generate higher resolution images. Due to GPU memory constraints, we did not try the AttnGAN with three attention models. As the result, our final model for CUB and COCO is “AttnGAN2, λ=5” and “AttnGAN2, λ=50”, respectively. The final λ of the COCO dataset turns out to be much larger than that of the CUB dataset, indicating that the proposed LDAMSM is especially important for generat-
  • 7. the bird has a yellow crown and a black eyering that is round this bird has a green crown black primaries and a white belly 1:bird 4:yellow 0:the 12:round 11:is 1:bird 0:this 2:has 11:belly 10:white 1:bird 4:yellow 0:the 8:black 12:round 6:black 4:green 10:white 0:this 1:bird a photo of a homemade swirly pasta with broccoli carrots and onions a fruit stand display with bananas and kiwi 0:a 7:with 5:swirly 8:broccoli 10:and 0:a 6:and 1:fruit 7:kiwi 5:bananas 8:broccoli 6:pasta 0:a 9:carrot 5:swirly 0:a 5:bananas 1:fruit 7:kiwi 6:and Figure 4. Intermediate results of our AttnGAN on CUB (top) and COCO (bottom) test sets. In each block, the first row gives 64×64 images by G0, 128×128 images by G1 and 256×256 images by G2 of the AttnGAN; the second and third row shows the top-5 most attended words by Fattn 1 and Fattn 2 of the AttnGAN, respectively. Refer to the supplementary material for more examples. Dataset GAN-INT-CLS [20] GAWWN [18] StackGAN [31] StackGAN-v2 [32] PPGN [16] Our AttnGAN CUB 2.88 ± .04 3.62 ± .07 3.70 ± .04 3.82 ± .06 / 4.36 ± .03 COCO 7.88 ± .07 / 8.45 ± .03 / 9.58 ± .21 25.89 ± .47 Table 3. Inception scores by state-of-the-art GAN models [20, 18, 31, 32, 16] and our AttnGAN on CUB and COCO test sets. ing complex scenarios like those in the COCO dataset. To better understand what has been learned by the At- tnGAN, we visualize its intermediate results with attention. As shown in Figure 4, the first stage of the AttnGAN (G0) just sketches the primitive shape and colors of objects and generates low resolution images. Since only the global sen- tence vectors are utilized in this stage, the generated images lack details described by exact words, e.g., the beak and eyes of a bird. Based on word vectors, the following stages (G1 and G2) learn to rectify defects in results of the previ- ous stage and add more details to generate higher-resolution images. Some sub-regions/pixels of G1 or G2 images can be inferred directly from images generated by the previous stage. For those sub-regions, the attention is equally allo-
  • 8. this bird has wings that are black and has a white belly this bird has wings that are red and has a yellow belly this bird has wings that are blue and has a red belly Figure 5. Example results of our AttnGAN model trained on CUB while changing some most attended words in the text descriptions. a fluffy black cat floating on top of a lake a red double decker bus is floating on top of a lake a stop sign is floating on top of a lake a stop sign is flying in the blue sky Figure 6. 256×256 images generated from descriptions of novel scenarios using the AttnGAN model trained on COCO. (Interme- diate results are given in the supplementary material.) Figure 7. Novel images by our AttnGAN on the CUB test set. cated to all words and shown to be black in the attention map (see Figure 4). For other sub-regions, which usually have semantic meaning expressed in the text description such as the attributes of objects, the attention is allocated to their most relevant words (bright regions in Figure 4). Thus, those regions are inferred from both word-context features and previous image features of those regions. As shown in Figure 4, on the CUB dataset, the words the, this, bird are usually attended by the Fattn models for locating the ob- ject; the words describing object attributes, such as colors and parts of birds, are also attended for correcting defects and drawing details. On the COCO dataset, we have similar observations. Since there are usually more than one ob- ject in each COCO image, it is more visible that the words describing different objects are attended by different sub- regions of the image, e.g., bananas, kiwi in the bottom-right block of Figure 4. Those observations demonstrate that the AttnGAN learns to understand the detailed semantic mean- ing expressed in the text description of an image. Another observation is that our second attention model Fattn 2 is able to attend to some new words that were omitted by the first attention model Fattn 1 (see Figure 4). It demonstrates that, to provide richer information for generating higher resolu- tion images at latter stages of the AttnGAN, the correspond- ing attention models learn to recover objects and attributes omitted at previous stages. Generalization ability. Our experimental results above have quantitatively and qualitatively shown the generaliza- tion ability of the AttnGAN by generating images from unseen text descriptions. Here we further test how sensi- tive the outputs are to changes in the input sentences by changing some most attended words in the text descriptions. Some examples are shown in Figure 5. It illustrates that the generated images are modified according to the changes in the input sentences, showing that the model can catch sub- tle semantic differences in the text description. Moreover, as shown in Figure 6, our AttnGAN can generate images to reflect the semantic meaning of descriptions of novel sce- narios that are not likely to happen in the real world, e.g., a stop sign is floating on top of a lake. On the other hand, we also observe that the AttnGAN sometimes generates im- ages which are sharp and detailed, but are not likely realis- tic. As examples shown in Figure 7, the AttnGAN creates birds with multiple heads, eyes or tails, which only exist in fairy tales. This indicates that our current method is still not perfect in capturing global coherent structures, which leaves room to improve. To sum up, observations shown in Figure 5, Figure 6 and Figure 7 further demonstrate the generalization ability of the AttnGAN. 4.2. Comparison with previous methods We compare our AttnGAN with previous state-of-the- art GAN models for text-to-image generation on CUB and COCO test sets. As shown in Table 3, on the CUB dataset, our AttnGAN achieves 4.36 inception score, which signif- icantly outperforms the previous best inception score of 3.82. More impressively, our AttnGAN boosts the best reported inception score on the COCO dataset from 9.58 to 25.89, a 170.25% improvement relatively. The COCO dataset is known to be much more challenging than the CUB dataset because it consists of images with more com- plex scenarios. Existing methods struggle in generating re- alistic high-resolution images on this dataset. Examples in Figure 4 and Figure 6 illustrate that our AttnGAN suc- ceeds in generating 256×256 images for various scenarios on the COCO dataset, although those generated images of the COCO dataset are not as photo-realistic as that of the CUB dataset. The experimental results show that, com- pared to previous state-of-the-art approaches, the AttnGAN is more effective for generating complex scenes due to its novel attention mechanism that catches fine-grained word level and sub-region level information in text-to-image gen- eration.
  • 9. 5. Conclusions In this paper, we propose an Attentional Generative Ad- versarial Network, named AttnGAN, for fine-grained text- to-image synthesis. First, we build a novel attentional gen- erative network for the AttnGAN to generate high qual- ity image through a multi-stage process. Second, we pro- pose a deep attentional multimodal similarity model to com- pute the fine-grained image-text matching loss for train- ing the generator of the AttnGAN. Our AttnGAN signif- icantly outperforms previous state-of-the-art GAN models, boosting the best reported inception score by 14.14% on the CUB dataset and 170.25% on the more challenging COCO dataset. Extensive experimental results clearly demonstrate the effectiveness of the proposed attention mechanism in the AttnGAN, which is especially critical for text-to-image generation for complex scenes. References [1] A. Agrawal, J. Lu, S. Antol, M. Mitchell, C. L. Zitnick, D. Parikh, and D. Batra. VQA: visual question answering. IJCV, 123(1):4–31, 2017. 1 [2] D. Bahdanau, K. Cho, and Y. Bengio. Neural machine translation by jointly learning to align and translate. arXiv:1409.0473, 2014. 2 [3] E. L. Denton, S. Chintala, A. Szlam, and R. Fergus. Deep generative image models using a laplacian pyramid of adversarial networks. In NIPS, 2015. 2 [4] H. Fang, S. Gupta, F. N. Iandola, R. K. Srivastava, L. Deng, P. Doll´ar, J. Gao, X. He, M. Mitchell, J. C. Platt, C. L. Zitnick, and G. Zweig. From captions to visual concepts and back. In CVPR, 2015. 1, 4, 5 [5] Z. Gan, C. Gan, X. He, Y. Pu, K. Tran, J. Gao, L. Carin, and L. Deng. Semantic compositional networks for visual captioning. In CVPR, 2017. 1 [6] I. J. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde- Farley, S. Ozair, A. C. Courville, and Y. Bengio. Generative ad- versarial nets. In NIPS, 2014. 1, 2 [7] K. Gregor, I. Danihelka, A. Graves, D. J. Rezende, and D. Wierstra. DRAW: A recurrent neural network for image generation. In ICML, 2015. 2 [8] X. He, L. Deng, and W. Chou. Discriminative learning in sequential pattern recognition. IEEE Signal Processing Magazine, 25(5):14–36, 2008. 4 [9] P.-S. Huang, X. He, J. Gao, L. Deng, A. Acero, and L. Heck. Learn- ing deep structured semantic models for web search using click- through data. In CIKM, 2013. 4, 5 [10] P. Isola, J.-Y. Zhu, T. Zhou, and A. A. Efros. Image-to-image trans- lation with conditional adversarial networks. In CVPR, 2017. 2 [11] B.-H. Juang, W. Chou, and C.-H. Lee. Minimum classification error rate methods for speech recognition. IEEE Transactions on Speech and Audio Processing, 5(3):257–265, 1997. 4 [12] D. P. Kingma and M. Welling. Auto-encoding variational bayes. In ICLR, 2014. 2 [13] C. Ledig, L. Theis, F. Huszar, J. Caballero, A. Aitken, A. Tejani, J. Totz, Z. Wang, and W. Shi. Photo-realistic single image super- resolution using a generative adversarial network. In CVPR, 2017. 2 [14] T.-Y. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan, P. Dollr, and C. L. Zitnick. Microsoft coco: Common objects in context. In ECCV, 2014. 1, 5 [15] E. Mansimov, E. Parisotto, L. J. Ba, and R. Salakhutdinov. Generat- ing images from captions with attention. In ICLR, 2016. 2 [16] A. Nguyen, J. Yosinski, Y. Bengio, A. Dosovitskiy, and J. Clune. Plug & play generative networks: Conditional iterative generation of images in latent space. In CVPR, 2017. 2, 5, 7 [17] A. Radford, L. Metz, and S. Chintala. Unsupervised representation learning with deep convolutional generative adversarial networks. In ICLR, 2016. 2 [18] S. Reed, Z. Akata, S. Mohan, S. Tenka, B. Schiele, and H. Lee. Learning what and where to draw. In NIPS, 2016. 1, 2, 5, 7 [19] S. Reed, Z. Akata, B. Schiele, and H. Lee. Learning deep represen- tations of fine-grained visual descriptions. In CVPR, 2016. 1, 6 [20] S. Reed, Z. Akata, X. Yan, L. Logeswaran, B. Schiele, and H. Lee. Generative adversarial text-to-image synthesis. In ICML, 2016. 1, 2, 5, 7 [21] S. E. Reed, A. van den Oord, N. Kalchbrenner, S. G. Colmenarejo, Z. Wang, Y. Chen, D. Belov, and N. de Freitas. Parallel multiscale autoregressive density estimation. In ICML, 2017. 2 [22] O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang, A. Karpathy, A. Khosla, M. Bernstein, A. C. Berg, and L. Fei-Fei. ImageNet Large Scale Visual Recognition Challenge. IJCV, 115(3):211–252, 2015. 4 [23] T. Salimans, I. J. Goodfellow, W. Zaremba, V. Cheung, A. Radford, and X. Chen. Improved techniques for training gans. In NIPS, 2016. 2, 5 [24] M. Schuster and K. K. Paliwal. Bidirectional recurrent neural net- works. IEEE Trans. Signal Processing, 45(11):2673–2681, 1997. 4 [25] C. Szegedy, V. Vanhoucke, S. Ioffe, J. Shlens, and Z. Wojna. Re- thinking the inception architecture for computer vision. In CVPR, 2016. 4 [26] A. van den Oord, N. Kalchbrenner, O. Vinyals, L. Espeholt, A. Graves, and K. Kavukcuoglu. Conditional image generation with pixelcnn decoders. In NIPS, 2016. 2 [27] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. Kaiser, and I. Polosukhin. Attention is all you need. arXiv:1706.03762, 2017. 2 [28] C. Wah, S. Branson, P. Welinder, P. Perona, and S. Belongie. The Caltech-UCSD Birds-200-2011 Dataset. Technical Report CNS-TR- 2011-001, California Institute of Technology, 2011. 5 [29] K. Xu, J. Ba, R. Kiros, K. Cho, A. C. Courville, R. Salakhutdinov, R. S. Zemel, and Y. Bengio. Show, attend and tell: Neural image caption generation with visual attention. In ICML, 2015. 1, 2 [30] Z. Yang, X. He, J. Gao, L. Deng, and A. J. Smola. Stacked attention networks for image question answering. In CVPR, 2016. 1, 2 [31] H. Zhang, T. Xu, H. Li, S. Zhang, X. Wang, X. Huang, and D. Metaxas. Stackgan: Text to photo-realistic image synthesis with stacked generative adversarial networks. In ICCV, 2017. 1, 2, 3, 5, 7 [32] H. Zhang, T. Xu, H. Li, S. Zhang, X. Wang, X. Huang, and D. N. Metaxas. Stackgan++: Realistic image synthesis with stacked gen- erative adversarial networks. arXiv: 1710.10916, 2017. 1, 2, 3, 5, 7 Supplementary Material Due to the size limit, more examples are available in the appendix, which can be download from this link.
  翻译: