ATTENTION BASED IMAGE CAPTIONING USING DEEP LEARNING

International Journal of Innovative Research in Advanced Engineering (IJIRAE) ISSN: 2349-2163
Issue 12, Volume 8 (December 2021) https:/ / www.ijirae.com/ archives
____________________________________________________________________________________________________________________________________
IJIRAE:: © 2014-22, AM Publications, India - All Rights Reserved Page -379
ATTENTION BASED IMAGE CAPTIONINGUSINGDEEP
LEARNING
Tejaswini Nakirekanti*
Dept of CSE, Mahatma Gandhi Institute of Technology,
Hyderabad-500075, Telangana, India
tejaswininakirekanti@gmail.com
Deepika D
Assistant Professor, Dept of CSE, Mahatma Gandhi Institute of Technology,
Hyderabad-500075, Telangana, India
ddeepika_cse@mgit.ac.in
Publication History
Manuscript Reference No: IJIRAE/ RS/ Vol.08/ Issue12/ DCAE10087
Research Article | Open Access
Peer-review: Double-blind Peer-reviewed
Article ID: IJIRAE/ RS/ Vol.08/ Issue12/ DCAE10087
Received: 20, December 2021
Accepted: 29, December 2021
Published Online: 30, December 2021
Volume 2021 | Article ID DCAE10087
http:/ / www.ijirae.com/ volumes/ Vol8/ iss-12/ 09.DCAE10087.pdf
Citation: Tejaswini,Deepika (2021). Attention Based image Captioning using Deep Learning. International Journal of
Innovative Research in Advanced Engineering, VIII, 379-387
doi: https:/ / doi.org/ 10.26562/ ijirae.2021.v0812.009
Editor-Chief: Dr.A.Arul Lawrence Selvakumar, Chief Editor, IJIRAE, AM Publications, India
Copyright: © 2021 This is an open access article distributed under the terms of the Creative Commons Attribution
License; Which Permits unrestricted use, distribution, and reproduction in any medium, provided the original
author and source are credited
Abstract: The process of deriving written descriptions from a picture based on the items and behaviours portrayed
is known as image caption generation. The resulting description will represent what is in the image and the
interactions among the objects. However, like with any other image processing challenge, recreating this
characteristic in an artificial system is a difficult feat, and so the adoption Deep Learning procedures can help to
tackle the problem. The main goal of this project is to generate an image captioning model that can provide a concise
and relatable description of the picture by incorporating an attention mechanism. Ideally, the model can be
separated into two components – one is called Encoder and the other is a Decoder. The encoder used here is a
Google InceptionV3 pre-trained model. And the decoder is a language-based model, Gated Recurrent Unit (GRU) to
translate the characteristics and objects provided by the encoder into natural sentences. Furthermore, the model's
effectiveness is aided by the attention mechanism. To conclude, the model is trained on a subset of the MS COCO
dataset, which contains about 80,000 pictures with at least five descriptions each and evaluated based on the BLEU
score. To reduce the loss, the model weights are tuned using the categorical cross-entropy loss function. The results
of the model are promising and competitive.
Keywords: CNN-RNN Architecture, Inception V3, Attention-mechanism, Gated recurrent unit (GRU), BLEU-metric.
I. INTRODUCTION
A computer can't tell what is there in a picture, unlike humans. We human beings can able to differentiate actions in
the world based on our knowledge. For a computer to analyse and differentiate the objects and actions in an image
is hard but not impossible. Deep learning advancements have led to the development of a model which can generate
captions for an image with the help of a large collection of data and computation power. This helps to give some
meaning to huge data, which also creates an impact in future research. Many applications can benefit from a model
that provides descriptions for images automatically.
Enhancing the accuracy of search engines, self-driving automobiles, identification, and vision applications are just a
few examples. Image captioning began to rely on deep learning for its recent advances. Usually, the encoder-decoder
framework is employed, where CNN is used as an encoder to extract information from images. RNN is employed as a
decoder to convert this representation into a natural language description. A CNN, on the other hand, must
condense all of the input data and then sent to the decoder. It is possible that compressing input sequences will
result in data loss. The CNN - RNN image captioning architecture includes an attention to focus on one part of the
image while generating a single word. The visual attention aids in reducing the impact of irrelevant words while
generating a single word at a point of time that is appropriate to the current context.

______________________________________________________________________________________________________________________________________
A. Problem Definition
Human participation is usually required to annotate images, making it a near-impossible process for large
commercial databases. Deep learning can aid in the dynamic annotation of these photos. Image captioning aims to
create a logical and holistic explanation of an image's content. The project's purpose is to create an image captioning
model that uses attention to construct a description from an input image.
B. Existing System
From retrieval-based and template-based methods, image captioning has grown rapidly in recent years. Retrieval
based methods tries to extract similar-captioned images to the query image by leveraging distance metric and
combine all the retrieved captions together to generate new caption. In the future, this approach may not be able to
produce new descriptions. In case of template based methods, it maintains a collection of sentence templates which
work according to the scenario in the image and tries to fill the template by depicting particular objects or actions
present in the target image. The main drawback of this technique is it cannot able to generate variable length
captions even though they are grammatically correct.
C. Proposed System
The proposed model employs a CNN to encode a picture which contains the nuances of the image. This is
subsequently utilized to build a context vector by the attention layer. Later the RNN decodes the information
produced by the attention model into a sequential, comprehensible statement.
II. LITERATURE SURVEY
Potential research work carried out on various techniques for the image caption generation using different
techniques and has been discussed in the following paragraphs. Various researchers have employed different
mechanisms for captioning the image and to find out the most useful features.
(Genc Hoxha et. al.) Proposed an image captioning framework that combines generation and retrieval-based image
captioning in this research. Multiple descriptions are generated by combining CNN-RNN architecture with beam-
searching. Instead of generating a single visual description, beam search assists in the generation of many
descriptions. The experiments are performed on RSCID dataset. The use of more advanced lexical similarity can
improve the lexical similarity between the generated and reference descriptions. [1]
(Ansar Hani et. al.) Put forth an encoder-decoder structure for an attention-based picture captioning model. The
attention mechanism helps you focus on certain parts of a visual rather than the complete picture. The encoder is a
pre-trained Google InceptionV3 model, and the decoder is a Gated Recurrent Unit (GRU). The encoder uses CNN to
extract the image's features, while the decoder derives relevant descriptions. The model was tested by MSCOCO
dataset, and the outcomes were provided. The model's is assessed using Adam's adaptive gradient optimizer. [2]
(Soheyla Amirian et. al.) Created a GAN network, which comprises a Descriptor and a Generator, in this study. The
Variational Autoencoder (VAE) is employed which helps in mapping the input to a distribution. Auto-encoders are
neural networks that learn data encodings without being supervised. The model added sentiment and diversity to
the captions to generate more human-like descriptions, rather than the predefined templates. The model is
evaluated using MSCOCO dataset. [3]
(Niange Yu et. al.) Introduced an image-captioning architecture in this study that constructs descriptions based on a
given topic. The multi-label classifier selects the subjects in a given image. The model takes an image-topic pair as
input and yields a natural sentence for the image as output. [4]
(Fang Fang et. al.) Developed a unique word-level attention layer to process visual attributes utilizing two modules
for reliable word prediction. The first module is a bidirectional spatial-embedding module for handling region
proposals, and the second extracts word-level attention, which is subsequently fed into the language model via an
attention approach. A recurrent neural network termed as Long-Short Term Memory (LSTM) is used in the language
model. The nuances are extracted using a convolutional neural network (CNN) from a picture. [5]
III. DESIGN METHODOLOGY
The purpose of picture caption extraction is to provide a detailed and comprehensive description of the image's
content. When extracting the sentence, it evaluates not just the objects in the image, but also their relationships to
create sequences. An attention oriented CNN-RNN framework is used to accomplish this. The following is a general
outline of the work involved in creating captions for images:
 Model Architecture
 Data Collection and preprocessing
 Model Building
A. Model Architecture
Fig. 1 depicts the complete system framework. The dataset used for the system is MSCOCO dataset which contains
images along with five descriptions per picture [6]. Later pre-processing is done to generate a vocabulary set using
the dataset captions of the images and also resizing the images. The encoder-decoder framework along with the
attention mechanism is included in the model building. The goal of the attention model is to boost the model's
efficiency.

______________________________________________________________________________________________________________________________________
The Google InceptionV3 model as an encoder is used to extract the nuances from an image and a RNN Gated
Recurrent Unit (GRU) is used to generate meaningful sequences. And the model efficiency is evaluated using the
BLEU metric.
Fig. 1 Flowchart of model architecture.
B. Data Collection and Preprocessing
Although there are various resources, MSCOCO dataset is chosen which is very popular and known for its large scale
object detection tasks is used to train the model. This dataset comprises a collection of photos of daily activities
along with their captions. It contains over 80,000 photos, each of which is accompanied by at least five user-
generated phrases describing the scenario depicted in the photograph. Fig. 2 shows a picture from the dataset,
which consist of five reference descriptions.
Fig. 2 A snapshot of an image from dataset.
1). Data Preparation and Preprocessing: The dataset has to be altered into a layout suitable for training the
model. As the dataset contains a different kind of images and descriptions there is a need for preparation and pre-
processing before being fed into the model.
2). Preprocessing Images: The dataset contains images of different dimensions; the images have to be scaled in
such a way that it is suitable for the model encoder. The encoder of the project accepts images of shape 299x299x3.
Each image must be scaled to the appropriate shape before feeding to the model.
3). Preprocessing Descriptions: The descriptions of the dataset contain English sentences. From these sentences,
a vocabulary has to be created based on the word frequency. This can be done by pre-processing techniques, where
the sentences are converted into words by eliminating punctuations. Later top 5000 most occurred words are
extracted and named as vocabulary. Machine learning models only accept vectors (arrays of numbers) as input. The
textual input which is nothing but strings has to be transformed into numbers (or the text must be "vectorized")
before being fed to the model. To do that word_index and index_word dictionaries are maintained as shown in Fig. 3
and Fig. 4, where each word is associated with a unique number and each number is associated with a unique word
[8]. These dictionaries are used to generate a sequence of numbers for sentences and vice versa, which are fed into
the network while training, testing, evaluating, and predicting.
4). Divide data into train and test sets: The purpose for dividing the data is straightforward: to prevent the model
from over fitting and to appropriately evaluate the model, it must be separated to train, validation, and test splits.

______________________________________________________________________________________________________________________________________
Fig. 3 Snap-shot of Word_index dictionary Fig. 4 Snap_shot of Index_word dictionary.
C. Model Building
Attention-based CNN - RNN architecture is employed to create the framework. The recommended model's
architecture is depicted in Fig. 5.
Fig. 5 The model architecture.
1) Encoder: The model's encoder is a pre-trained convolutional neural network called Google InceptionV3, which
produces a representation of the input image. This model was placed second in ILSVRC2015 [9] image classification
competition. The last layer was removed as it is used for classification. The characteristics are extracted from the
pre-processed images using the Inception V3 model. Fig. 6 shows the 42-layer structural design of the InceptioV3
model. Inception-v3 suggested a novel inception model that combines several different sized convolutional filters
into a single filter. As a result of this design, the number of parameters that must be taught is reduced, and the
computational complexity is reduced.
The encoder accepts images as input of shape 299x299x3 and gives the output of shape 8x8x2048. A fully connected
layer is used at the end to convert into a one-dimensional array of size 1001 which is used for classification. The
attention mechanism is implemented by focusing on certain parts of the image while generating sequences rather
than considering the full image. A fully connected layer representation does not capture the exact locations of the
image to achieve an attention mechanism as it is one-dimensional.

______________________________________________________________________________________________________________________________________
Fig. 6 Architecture of InceptionV3 model.
To achieve attention-mechanism, the last layer of architecture is removed and the hidden layer which has an output
shape of 8x8x2048 is considered as the output layer. This output shape will retain the location information of the
image and becomes easy to focus on certain regions while generating sequences [10]. Let us consider a CNN
network as shown in Fig. 7 where the input is an image and output is a one-dimensional vector of shape 1000. This
vector may not exactly capture the locations. For example, in Fig. 8 the output is from the convolution layer which
captures the locations of the original image.
Fig. 7 A convolutional layer with 1-dimensional output.
Fig. 8 A convolutional layer with 3-dimensional output capturing the locations of the image.
2) Attention-Mechanism: When creating descriptions for images, local attention likes to concentrate on particular
areas of a given image. The decoder would pay attention to specific parts of the image considering the hidden state
at time t, and construct a context vector utilizing the nuances of the image from CNN's convolution layer. Giving the
entire image as input to the RNN to produce sequences is not always possible, and it may result in inappropriate
captions. Instead of using the CNN's fully connected layer, one of the convolutional layers' output, which contains
spatial information about the image, is used as input to the attention model.
The encoder output is of shape 8x8x2048 is reshaped as 64x2048 and given as input. The semantic attention-
mechanism utilised will automatically reduce the weights of insignificant words to eliminate the influence of non-
relevant features terms. These attention weights are given as input to the language model which helps in generating
sequences. As a result, these weights direct the model to generate relevant phrases that focus on the image's objects.
Fig. 9 depicts how the attention mechanism contributes to the generation of captions from a given image. The model
has only analysed and highlighted significant aspects of the image to generate captions. Firstly, the section
containing the man would be taken into consideration. Then the section around it, that is the section containing
Snow Mountain. Thus, the caption generated here would be "A person riding skis on a mountain".

______________________________________________________________________________________________________________________________________
Fig. 9 Working of attention mechanism.
Decoder: The decoder is a language model which converts the nuances extracted from an image to textual
descriptions. GRU is preferably employed because of its structure and it does not have the problem of vanishing
gradients. The attention layer produces a vector of one dimension which is fed to GRU. The GRU is made up of two
gates: an update gate and a reset gate [11]. The purpose of the update gate is to educate the model about "how much
of the previous information has to be kept, i.e., passed on to the future”. The goal of the reset gate is to figure out "how
much of the past data should be forgotten." In each level, the weights are adjusted according to these gates. Fig. 10
depicts the architecture of the GRU, which is used to generate sequences based on experience.
Fig. 10 Architecture of Gated Recurrent Unit (GRU).
IV. TESTINGAND RESULTS
Tensor Flow 2.0 is used to create the model. Resize the supplied photos to 299 by 299 pixels. Later, the last
convolutional layer's 8 x 8 x 2048 activation maps are used as annotations. The loss plot is represented in Fig. 11.
The loss function is used to reduce the loss and tune the model weights accordingly.
Fig. 11 Loss Plot.
A. Model Evaluation
The BLEU metric [7] is used to evaluate the model. The Bilingual Evaluation Understudy, or BLEU, is a metric for
comparing a translated text to one or perhaps more reference translations. It was designed for translation, but it
may also be used to compare the text generated by the model to a reference or real captions.

______________________________________________________________________________________________________________________________________
The statistic is based on a scale of 0 to 1. Regardless of the word order, the comparison is made. Fig. 12 represents
the predicted caption and a BLEU score for the image.
Fig. 12 A picture and its result from the test set.
B. Results
Some results of the model are given below, the model captions the given image based on the items in the picture and
the interactions among them. Fig. 13 shows the results predicted by the model from the test set where it has the real
caption, predicted caption, BLEU score and attention plot of the given image. Which has the generated caption
appropriate to the content of the image.
Fig. 13 Result of a picture from the test set.
As the model is trained with only 10,000 images from the dataset it is obvious that the model may generate
inappropriate captions as shown in Fig. 14.
Fig. 14 Result of a picture from the test set.

______________________________________________________________________________________________________________________________________
Let us see the generated captions by the model for pictures other than the train set and test set. Fig. 15 shows the
appropriate caption generated for the given image, Fig. 16 shows the predicted caption for the given image which is
quite inappropriate.
Fig. 15 Appropriate caption for the given image.
Fig. 16 Inappropriate caption for the given image.
V. CONCLUSION AND FUTURE WORK
A. Conclusion
An image captioning model which integrates attention is described to construct a natural sentence from a given
image. The characteristics of an image are extracted be employing a convolutional neural network, and is then given
to attention layer to generate a context-vector. Later, it is decoded by a recurrent neural network into
understandable sequences. The attention mechanism aids in the optimization of findings as well as the creation of
appropriate captions. The model is trained using a subset of the MS COCO dataset, and the number of training
samples must be increased to improve the model's performance. The model can also be trained on a variety of
datasets. The model is assessed using the BLEU metric, which considers translation length, the word choice, and
word order to assess how similar the machine translation is to the human reference translation. When compared to
previous research, the outcomes achieved are promising and competitive.
B. Future Work
For future work, implementing a more compact deep convolution network may help to enhance the efficiency of the
proposed model. The recommended model can generate only one caption but it can be increased with the help of the
beam search where the model generates more than one caption for an image. In the future, the proposed paradigm
could be updated to include video captioning. That is, the video's information will be summarized in sequences.
REFERENCES
1. Farid Melgani, Genc Hoxha, Jacopo Slaghenauffi. (2020). A New CNN-RNN Framework For Remote Sensing Image
Captioning. 2020 Mediterranean and Middle-East Geoscience and Remote Sensing Symposium (M2GARSS).
2. Ansar Hani, Najiba Tagougui, Monji Kherallah. (2019). Image Caption Generation Using A Deep Architecture.
2019 International Arab Conference on Information Technology (ACIT) deep.
3. Soheyla Amirian, Khaled Rasheed, Thiab R. Taha, and Hamid R. Arabnia. (2019). Image Captioning with
Generative Adversarial Network. 2019 International Conference on Computational Science and Computational
Intelligence (CSCI).

______________________________________________________________________________________________________________________________________
4. Niange Yu, Xiaolin Hu, Binheng Song, Jian Yang, and Jianwei Zhang. (2018). Topic-Oriented Image Captioning
Based on Order-Embedding. IEEE Transactions on Image Processing (Volume: 28, Issue: 6, June 2019).
5. Fang Fang, Hanli Wang, Pengjie Tang. (2018). Image Captioning with word level attention. 2018 25th IEEE
International Conference on Image Processing (ICIP).
6. Dataset resource link. Available: https:/ / cocodataset.org/ #home.
7. Jason Brownlee, A Gentle Introduction to Calculating the BLEU Score for Text in Python, Deep Learning for
Natural Language Processing, November 20, 2017.
8. Rishi Sidhu, Turn words into numbers - The NLP way, Sep 1, 2020.
9. Sik-Ho Tsang, Inception-v3 — 1st Runner Up (Image Classification) in ILSVRC2015,Sep 10,2018.
10.NPTEL-NOC IITM. “Deep Learning (CS7015): Lec 15.4 Attention over images,” YouTube, Oct 24, 2018 [Video
File].
11.Available:https:/ / www.youtube.com/ watch?v=hvhqHhrP_AU&list=PLyqSpQzTE6M9gCgajvQbc68Hk_JKGBAYT&
index=119. [Accessed:Apr 20, 2021].
12.Simeon Kostadinov, Understanding GRU Networks, Dec 16, 2017.

ATTENTION BASED IMAGE CAPTIONING USING DEEP LEARNING

Recommended

More Related Content

Similar to ATTENTION BASED IMAGE CAPTIONING USING DEEP LEARNING (20)

More from Nathan Mathis (20)

Recently uploaded (20)

ATTENTION BASED IMAGE CAPTIONING USING DEEP LEARNING