SlideShare a Scribd company logo
International Journal of Innovative Research in Advanced Engineering (IJIRAE) ISSN: 2349-2163
Issue 12, Volume 8 (December 2021) https:/ / www.ijirae.com/ archives
____________________________________________________________________________________________________________________________________
IJIRAE:: © 2014-22, AM Publications, India - All Rights Reserved Page -379
ATTENTION BASED IMAGE CAPTIONINGUSINGDEEP
LEARNING
Tejaswini Nakirekanti*
Dept of CSE, Mahatma Gandhi Institute of Technology,
Hyderabad-500075, Telangana, India
tejaswininakirekanti@gmail.com
Deepika D
Assistant Professor, Dept of CSE, Mahatma Gandhi Institute of Technology,
Hyderabad-500075, Telangana, India
ddeepika_cse@mgit.ac.in
Publication History
Manuscript Reference No: IJIRAE/ RS/ Vol.08/ Issue12/ DCAE10087
Research Article | Open Access
Peer-review: Double-blind Peer-reviewed
Article ID: IJIRAE/ RS/ Vol.08/ Issue12/ DCAE10087
Received: 20, December 2021
Accepted: 29, December 2021
Published Online: 30, December 2021
Volume 2021 | Article ID DCAE10087
http:/ / www.ijirae.com/ volumes/ Vol8/ iss-12/ 09.DCAE10087.pdf
Citation: Tejaswini,Deepika (2021). Attention Based image Captioning using Deep Learning. International Journal of
Innovative Research in Advanced Engineering, VIII, 379-387
doi: https:/ / doi.org/ 10.26562/ ijirae.2021.v0812.009
Editor-Chief: Dr.A.Arul Lawrence Selvakumar, Chief Editor, IJIRAE, AM Publications, India
Copyright: © 2021 This is an open access article distributed under the terms of the Creative Commons Attribution
License; Which Permits unrestricted use, distribution, and reproduction in any medium, provided the original
author and source are credited
Abstract: The process of deriving written descriptions from a picture based on the items and behaviours portrayed
is known as image caption generation. The resulting description will represent what is in the image and the
interactions among the objects. However, like with any other image processing challenge, recreating this
characteristic in an artificial system is a difficult feat, and so the adoption Deep Learning procedures can help to
tackle the problem. The main goal of this project is to generate an image captioning model that can provide a concise
and relatable description of the picture by incorporating an attention mechanism. Ideally, the model can be
separated into two components – one is called Encoder and the other is a Decoder. The encoder used here is a
Google InceptionV3 pre-trained model. And the decoder is a language-based model, Gated Recurrent Unit (GRU) to
translate the characteristics and objects provided by the encoder into natural sentences. Furthermore, the model's
effectiveness is aided by the attention mechanism. To conclude, the model is trained on a subset of the MS COCO
dataset, which contains about 80,000 pictures with at least five descriptions each and evaluated based on the BLEU
score. To reduce the loss, the model weights are tuned using the categorical cross-entropy loss function. The results
of the model are promising and competitive.
Keywords: CNN-RNN Architecture, Inception V3, Attention-mechanism, Gated recurrent unit (GRU), BLEU-metric.
I. INTRODUCTION
A computer can't tell what is there in a picture, unlike humans. We human beings can able to differentiate actions in
the world based on our knowledge. For a computer to analyse and differentiate the objects and actions in an image
is hard but not impossible. Deep learning advancements have led to the development of a model which can generate
captions for an image with the help of a large collection of data and computation power. This helps to give some
meaning to huge data, which also creates an impact in future research. Many applications can benefit from a model
that provides descriptions for images automatically.
Enhancing the accuracy of search engines, self-driving automobiles, identification, and vision applications are just a
few examples. Image captioning began to rely on deep learning for its recent advances. Usually, the encoder-decoder
framework is employed, where CNN is used as an encoder to extract information from images. RNN is employed as a
decoder to convert this representation into a natural language description. A CNN, on the other hand, must
condense all of the input data and then sent to the decoder. It is possible that compressing input sequences will
result in data loss. The CNN - RNN image captioning architecture includes an attention to focus on one part of the
image while generating a single word. The visual attention aids in reducing the impact of irrelevant words while
generating a single word at a point of time that is appropriate to the current context.
International Journal of Innovative Research in Advanced Engineering (IJIRAE) ISSN: 2349-2163
Issue 12, Volume 8 (December 2021) https:/ / www.ijirae.com/ archives
______________________________________________________________________________________________________________________________________
IJIRAE:: © 2014-20, AM Publications, India - All Rights Reserved Page -380
A. Problem Definition
Human participation is usually required to annotate images, making it a near-impossible process for large
commercial databases. Deep learning can aid in the dynamic annotation of these photos. Image captioning aims to
create a logical and holistic explanation of an image's content. The project's purpose is to create an image captioning
model that uses attention to construct a description from an input image.
B. Existing System
From retrieval-based and template-based methods, image captioning has grown rapidly in recent years. Retrieval
based methods tries to extract similar-captioned images to the query image by leveraging distance metric and
combine all the retrieved captions together to generate new caption. In the future, this approach may not be able to
produce new descriptions. In case of template based methods, it maintains a collection of sentence templates which
work according to the scenario in the image and tries to fill the template by depicting particular objects or actions
present in the target image. The main drawback of this technique is it cannot able to generate variable length
captions even though they are grammatically correct.
C. Proposed System
The proposed model employs a CNN to encode a picture which contains the nuances of the image. This is
subsequently utilized to build a context vector by the attention layer. Later the RNN decodes the information
produced by the attention model into a sequential, comprehensible statement.
II. LITERATURE SURVEY
Potential research work carried out on various techniques for the image caption generation using different
techniques and has been discussed in the following paragraphs. Various researchers have employed different
mechanisms for captioning the image and to find out the most useful features.
(Genc Hoxha et. al.) Proposed an image captioning framework that combines generation and retrieval-based image
captioning in this research. Multiple descriptions are generated by combining CNN-RNN architecture with beam-
searching. Instead of generating a single visual description, beam search assists in the generation of many
descriptions. The experiments are performed on RSCID dataset. The use of more advanced lexical similarity can
improve the lexical similarity between the generated and reference descriptions. [1]
(Ansar Hani et. al.) Put forth an encoder-decoder structure for an attention-based picture captioning model. The
attention mechanism helps you focus on certain parts of a visual rather than the complete picture. The encoder is a
pre-trained Google InceptionV3 model, and the decoder is a Gated Recurrent Unit (GRU). The encoder uses CNN to
extract the image's features, while the decoder derives relevant descriptions. The model was tested by MSCOCO
dataset, and the outcomes were provided. The model's is assessed using Adam's adaptive gradient optimizer. [2]
(Soheyla Amirian et. al.) Created a GAN network, which comprises a Descriptor and a Generator, in this study. The
Variational Autoencoder (VAE) is employed which helps in mapping the input to a distribution. Auto-encoders are
neural networks that learn data encodings without being supervised. The model added sentiment and diversity to
the captions to generate more human-like descriptions, rather than the predefined templates. The model is
evaluated using MSCOCO dataset. [3]
(Niange Yu et. al.) Introduced an image-captioning architecture in this study that constructs descriptions based on a
given topic. The multi-label classifier selects the subjects in a given image. The model takes an image-topic pair as
input and yields a natural sentence for the image as output. [4]
(Fang Fang et. al.) Developed a unique word-level attention layer to process visual attributes utilizing two modules
for reliable word prediction. The first module is a bidirectional spatial-embedding module for handling region
proposals, and the second extracts word-level attention, which is subsequently fed into the language model via an
attention approach. A recurrent neural network termed as Long-Short Term Memory (LSTM) is used in the language
model. The nuances are extracted using a convolutional neural network (CNN) from a picture. [5]
III. DESIGN METHODOLOGY
The purpose of picture caption extraction is to provide a detailed and comprehensive description of the image's
content. When extracting the sentence, it evaluates not just the objects in the image, but also their relationships to
create sequences. An attention oriented CNN-RNN framework is used to accomplish this. The following is a general
outline of the work involved in creating captions for images:
 Model Architecture
 Data Collection and preprocessing
 Model Building
A. Model Architecture
Fig. 1 depicts the complete system framework. The dataset used for the system is MSCOCO dataset which contains
images along with five descriptions per picture [6]. Later pre-processing is done to generate a vocabulary set using
the dataset captions of the images and also resizing the images. The encoder-decoder framework along with the
attention mechanism is included in the model building. The goal of the attention model is to boost the model's
efficiency.
International Journal of Innovative Research in Advanced Engineering (IJIRAE) ISSN: 2349-2163
Issue 12, Volume 8 (December 2021) https:/ / www.ijirae.com/ archives
______________________________________________________________________________________________________________________________________
IJIRAE:: © 2014-20, AM Publications, India - All Rights Reserved Page -381
The Google InceptionV3 model as an encoder is used to extract the nuances from an image and a RNN Gated
Recurrent Unit (GRU) is used to generate meaningful sequences. And the model efficiency is evaluated using the
BLEU metric.
Fig. 1 Flowchart of model architecture.
B. Data Collection and Preprocessing
Although there are various resources, MSCOCO dataset is chosen which is very popular and known for its large scale
object detection tasks is used to train the model. This dataset comprises a collection of photos of daily activities
along with their captions. It contains over 80,000 photos, each of which is accompanied by at least five user-
generated phrases describing the scenario depicted in the photograph. Fig. 2 shows a picture from the dataset,
which consist of five reference descriptions.
Fig. 2 A snapshot of an image from dataset.
1). Data Preparation and Preprocessing: The dataset has to be altered into a layout suitable for training the
model. As the dataset contains a different kind of images and descriptions there is a need for preparation and pre-
processing before being fed into the model.
2). Preprocessing Images: The dataset contains images of different dimensions; the images have to be scaled in
such a way that it is suitable for the model encoder. The encoder of the project accepts images of shape 299x299x3.
Each image must be scaled to the appropriate shape before feeding to the model.
3). Preprocessing Descriptions: The descriptions of the dataset contain English sentences. From these sentences,
a vocabulary has to be created based on the word frequency. This can be done by pre-processing techniques, where
the sentences are converted into words by eliminating punctuations. Later top 5000 most occurred words are
extracted and named as vocabulary. Machine learning models only accept vectors (arrays of numbers) as input. The
textual input which is nothing but strings has to be transformed into numbers (or the text must be "vectorized")
before being fed to the model. To do that word_index and index_word dictionaries are maintained as shown in Fig. 3
and Fig. 4, where each word is associated with a unique number and each number is associated with a unique word
[8]. These dictionaries are used to generate a sequence of numbers for sentences and vice versa, which are fed into
the network while training, testing, evaluating, and predicting.
4). Divide data into train and test sets: The purpose for dividing the data is straightforward: to prevent the model
from over fitting and to appropriately evaluate the model, it must be separated to train, validation, and test splits.
International Journal of Innovative Research in Advanced Engineering (IJIRAE) ISSN: 2349-2163
Issue 12, Volume 8 (December 2021) https:/ / www.ijirae.com/ archives
______________________________________________________________________________________________________________________________________
IJIRAE:: © 2014-20, AM Publications, India - All Rights Reserved Page -382
Fig. 3 Snap-shot of Word_index dictionary Fig. 4 Snap_shot of Index_word dictionary.
C. Model Building
Attention-based CNN - RNN architecture is employed to create the framework. The recommended model's
architecture is depicted in Fig. 5.
Fig. 5 The model architecture.
1) Encoder: The model's encoder is a pre-trained convolutional neural network called Google InceptionV3, which
produces a representation of the input image. This model was placed second in ILSVRC2015 [9] image classification
competition. The last layer was removed as it is used for classification. The characteristics are extracted from the
pre-processed images using the Inception V3 model. Fig. 6 shows the 42-layer structural design of the InceptioV3
model. Inception-v3 suggested a novel inception model that combines several different sized convolutional filters
into a single filter. As a result of this design, the number of parameters that must be taught is reduced, and the
computational complexity is reduced.
The encoder accepts images as input of shape 299x299x3 and gives the output of shape 8x8x2048. A fully connected
layer is used at the end to convert into a one-dimensional array of size 1001 which is used for classification. The
attention mechanism is implemented by focusing on certain parts of the image while generating sequences rather
than considering the full image. A fully connected layer representation does not capture the exact locations of the
image to achieve an attention mechanism as it is one-dimensional.
International Journal of Innovative Research in Advanced Engineering (IJIRAE) ISSN: 2349-2163
Issue 12, Volume 8 (December 2021) https:/ / www.ijirae.com/ archives
______________________________________________________________________________________________________________________________________
IJIRAE:: © 2014-20, AM Publications, India - All Rights Reserved Page -383
Fig. 6 Architecture of InceptionV3 model.
To achieve attention-mechanism, the last layer of architecture is removed and the hidden layer which has an output
shape of 8x8x2048 is considered as the output layer. This output shape will retain the location information of the
image and becomes easy to focus on certain regions while generating sequences [10]. Let us consider a CNN
network as shown in Fig. 7 where the input is an image and output is a one-dimensional vector of shape 1000. This
vector may not exactly capture the locations. For example, in Fig. 8 the output is from the convolution layer which
captures the locations of the original image.
Fig. 7 A convolutional layer with 1-dimensional output.
Fig. 8 A convolutional layer with 3-dimensional output capturing the locations of the image.
2) Attention-Mechanism: When creating descriptions for images, local attention likes to concentrate on particular
areas of a given image. The decoder would pay attention to specific parts of the image considering the hidden state
at time t, and construct a context vector utilizing the nuances of the image from CNN's convolution layer. Giving the
entire image as input to the RNN to produce sequences is not always possible, and it may result in inappropriate
captions. Instead of using the CNN's fully connected layer, one of the convolutional layers' output, which contains
spatial information about the image, is used as input to the attention model.
The encoder output is of shape 8x8x2048 is reshaped as 64x2048 and given as input. The semantic attention-
mechanism utilised will automatically reduce the weights of insignificant words to eliminate the influence of non-
relevant features terms. These attention weights are given as input to the language model which helps in generating
sequences. As a result, these weights direct the model to generate relevant phrases that focus on the image's objects.
Fig. 9 depicts how the attention mechanism contributes to the generation of captions from a given image. The model
has only analysed and highlighted significant aspects of the image to generate captions. Firstly, the section
containing the man would be taken into consideration. Then the section around it, that is the section containing
Snow Mountain. Thus, the caption generated here would be "A person riding skis on a mountain".
International Journal of Innovative Research in Advanced Engineering (IJIRAE) ISSN: 2349-2163
Issue 12, Volume 8 (December 2021) https:/ / www.ijirae.com/ archives
______________________________________________________________________________________________________________________________________
IJIRAE:: © 2014-20, AM Publications, India - All Rights Reserved Page -384
Fig. 9 Working of attention mechanism.
Decoder: The decoder is a language model which converts the nuances extracted from an image to textual
descriptions. GRU is preferably employed because of its structure and it does not have the problem of vanishing
gradients. The attention layer produces a vector of one dimension which is fed to GRU. The GRU is made up of two
gates: an update gate and a reset gate [11]. The purpose of the update gate is to educate the model about "how much
of the previous information has to be kept, i.e., passed on to the future”. The goal of the reset gate is to figure out "how
much of the past data should be forgotten." In each level, the weights are adjusted according to these gates. Fig. 10
depicts the architecture of the GRU, which is used to generate sequences based on experience.
Fig. 10 Architecture of Gated Recurrent Unit (GRU).
IV. TESTINGAND RESULTS
Tensor Flow 2.0 is used to create the model. Resize the supplied photos to 299 by 299 pixels. Later, the last
convolutional layer's 8 x 8 x 2048 activation maps are used as annotations. The loss plot is represented in Fig. 11.
The loss function is used to reduce the loss and tune the model weights accordingly.
Fig. 11 Loss Plot.
A. Model Evaluation
The BLEU metric [7] is used to evaluate the model. The Bilingual Evaluation Understudy, or BLEU, is a metric for
comparing a translated text to one or perhaps more reference translations. It was designed for translation, but it
may also be used to compare the text generated by the model to a reference or real captions.
International Journal of Innovative Research in Advanced Engineering (IJIRAE) ISSN: 2349-2163
Issue 12, Volume 8 (December 2021) https:/ / www.ijirae.com/ archives
______________________________________________________________________________________________________________________________________
IJIRAE:: © 2014-20, AM Publications, India - All Rights Reserved Page -385
The statistic is based on a scale of 0 to 1. Regardless of the word order, the comparison is made. Fig. 12 represents
the predicted caption and a BLEU score for the image.
Fig. 12 A picture and its result from the test set.
B. Results
Some results of the model are given below, the model captions the given image based on the items in the picture and
the interactions among them. Fig. 13 shows the results predicted by the model from the test set where it has the real
caption, predicted caption, BLEU score and attention plot of the given image. Which has the generated caption
appropriate to the content of the image.
Fig. 13 Result of a picture from the test set.
As the model is trained with only 10,000 images from the dataset it is obvious that the model may generate
inappropriate captions as shown in Fig. 14.
Fig. 14 Result of a picture from the test set.
International Journal of Innovative Research in Advanced Engineering (IJIRAE) ISSN: 2349-2163
Issue 12, Volume 8 (December 2021) https:/ / www.ijirae.com/ archives
______________________________________________________________________________________________________________________________________
IJIRAE:: © 2014-20, AM Publications, India - All Rights Reserved Page -386
Let us see the generated captions by the model for pictures other than the train set and test set. Fig. 15 shows the
appropriate caption generated for the given image, Fig. 16 shows the predicted caption for the given image which is
quite inappropriate.
Fig. 15 Appropriate caption for the given image.
Fig. 16 Inappropriate caption for the given image.
V. CONCLUSION AND FUTURE WORK
A. Conclusion
An image captioning model which integrates attention is described to construct a natural sentence from a given
image. The characteristics of an image are extracted be employing a convolutional neural network, and is then given
to attention layer to generate a context-vector. Later, it is decoded by a recurrent neural network into
understandable sequences. The attention mechanism aids in the optimization of findings as well as the creation of
appropriate captions. The model is trained using a subset of the MS COCO dataset, and the number of training
samples must be increased to improve the model's performance. The model can also be trained on a variety of
datasets. The model is assessed using the BLEU metric, which considers translation length, the word choice, and
word order to assess how similar the machine translation is to the human reference translation. When compared to
previous research, the outcomes achieved are promising and competitive.
B. Future Work
For future work, implementing a more compact deep convolution network may help to enhance the efficiency of the
proposed model. The recommended model can generate only one caption but it can be increased with the help of the
beam search where the model generates more than one caption for an image. In the future, the proposed paradigm
could be updated to include video captioning. That is, the video's information will be summarized in sequences.
REFERENCES
1. Farid Melgani, Genc Hoxha, Jacopo Slaghenauffi. (2020). A New CNN-RNN Framework For Remote Sensing Image
Captioning. 2020 Mediterranean and Middle-East Geoscience and Remote Sensing Symposium (M2GARSS).
2. Ansar Hani, Najiba Tagougui, Monji Kherallah. (2019). Image Caption Generation Using A Deep Architecture.
2019 International Arab Conference on Information Technology (ACIT) deep.
3. Soheyla Amirian, Khaled Rasheed, Thiab R. Taha, and Hamid R. Arabnia. (2019). Image Captioning with
Generative Adversarial Network. 2019 International Conference on Computational Science and Computational
Intelligence (CSCI).
International Journal of Innovative Research in Advanced Engineering (IJIRAE) ISSN: 2349-2163
Issue 12, Volume 8 (December 2021) https:/ / www.ijirae.com/ archives
______________________________________________________________________________________________________________________________________
IJIRAE:: © 2014-20, AM Publications, India - All Rights Reserved Page -387
4. Niange Yu, Xiaolin Hu, Binheng Song, Jian Yang, and Jianwei Zhang. (2018). Topic-Oriented Image Captioning
Based on Order-Embedding. IEEE Transactions on Image Processing (Volume: 28, Issue: 6, June 2019).
5. Fang Fang, Hanli Wang, Pengjie Tang. (2018). Image Captioning with word level attention. 2018 25th IEEE
International Conference on Image Processing (ICIP).
6. Dataset resource link. Available: https:/ / cocodataset.org/ #home.
7. Jason Brownlee, A Gentle Introduction to Calculating the BLEU Score for Text in Python, Deep Learning for
Natural Language Processing, November 20, 2017.
8. Rishi Sidhu, Turn words into numbers - The NLP way, Sep 1, 2020.
9. Sik-Ho Tsang, Inception-v3 — 1st Runner Up (Image Classification) in ILSVRC2015,Sep 10,2018.
10.NPTEL-NOC IITM. “Deep Learning (CS7015): Lec 15.4 Attention over images,” YouTube, Oct 24, 2018 [Video
File].
11.Available:https:/ / www.youtube.com/ watch?v=hvhqHhrP_AU&list=PLyqSpQzTE6M9gCgajvQbc68Hk_JKGBAYT&
index=119. [Accessed:Apr 20, 2021].
12.Simeon Kostadinov, Understanding GRU Networks, Dec 16, 2017.
Ad

More Related Content

Similar to ATTENTION BASED IMAGE CAPTIONING USING DEEP LEARNING (20)

An Intelligent Approach for Effective Retrieval of Content from Large Data Se...
An Intelligent Approach for Effective Retrieval of Content from Large Data Se...An Intelligent Approach for Effective Retrieval of Content from Large Data Se...
An Intelligent Approach for Effective Retrieval of Content from Large Data Se...
IJCSIS Research Publications
 
IMAGE CAPTIONING USING TRANSFORMER: VISIONAID
IMAGE CAPTIONING USING TRANSFORMER: VISIONAIDIMAGE CAPTIONING USING TRANSFORMER: VISIONAID
IMAGE CAPTIONING USING TRANSFORMER: VISIONAID
IRJET Journal
 
Image captioning using DL and NLP.pptx
Image captioning using DL and NLP.pptxImage captioning using DL and NLP.pptx
Image captioning using DL and NLP.pptx
MrUnknown820784
 
Optimal text-to-image synthesis model for generating portrait images using ge...
Optimal text-to-image synthesis model for generating portrait images using ge...Optimal text-to-image synthesis model for generating portrait images using ge...
Optimal text-to-image synthesis model for generating portrait images using ge...
nooriasukmaningtyas
 
Image super resolution using Generative Adversarial Network.
Image super resolution using Generative Adversarial Network.Image super resolution using Generative Adversarial Network.
Image super resolution using Generative Adversarial Network.
IRJET Journal
 
IRJET - Deep Learning Approach to Inpainting and Outpainting System
IRJET -  	  Deep Learning Approach to Inpainting and Outpainting SystemIRJET -  	  Deep Learning Approach to Inpainting and Outpainting System
IRJET - Deep Learning Approach to Inpainting and Outpainting System
IRJET Journal
 
Methodology for eliminating plain regions from captured images
Methodology for eliminating plain regions from captured imagesMethodology for eliminating plain regions from captured images
Methodology for eliminating plain regions from captured images
IAESIJAI
 
Overview of Video Concept Detection using (CNN) Convolutional Neural Network
Overview of Video Concept Detection using (CNN) Convolutional Neural NetworkOverview of Video Concept Detection using (CNN) Convolutional Neural Network
Overview of Video Concept Detection using (CNN) Convolutional Neural Network
IRJET Journal
 
IRJET- Visual Information Narrator using Neural Network
IRJET- Visual Information Narrator using Neural NetworkIRJET- Visual Information Narrator using Neural Network
IRJET- Visual Information Narrator using Neural Network
IRJET Journal
 
Cartoonization of images using machine Learning
Cartoonization of images using machine LearningCartoonization of images using machine Learning
Cartoonization of images using machine Learning
IRJET Journal
 
A Graph-based Web Image Annotation for Large Scale Image Retrieval
A Graph-based Web Image Annotation for Large Scale Image RetrievalA Graph-based Web Image Annotation for Large Scale Image Retrieval
A Graph-based Web Image Annotation for Large Scale Image Retrieval
IRJET Journal
 
A deep learning based stereo matching model for autonomous vehicle
A deep learning based stereo matching model for autonomous vehicleA deep learning based stereo matching model for autonomous vehicle
A deep learning based stereo matching model for autonomous vehicle
IAESIJAI
 
Paper id 25201491
Paper id 25201491Paper id 25201491
Paper id 25201491
IJRAT
 
IRJET - Content based Image Classification
IRJET -  	  Content based Image ClassificationIRJET -  	  Content based Image Classification
IRJET - Content based Image Classification
IRJET Journal
 
IRJET- Saliency based Image Co-Segmentation
IRJET- Saliency based Image Co-SegmentationIRJET- Saliency based Image Co-Segmentation
IRJET- Saliency based Image Co-Segmentation
IRJET Journal
 
IRJET-MText Extraction from Images using Convolutional Neural Network
IRJET-MText Extraction from Images using Convolutional Neural NetworkIRJET-MText Extraction from Images using Convolutional Neural Network
IRJET-MText Extraction from Images using Convolutional Neural Network
IRJET Journal
 
IRJET- Image Caption Generation System using Neural Network with Attention Me...
IRJET- Image Caption Generation System using Neural Network with Attention Me...IRJET- Image Caption Generation System using Neural Network with Attention Me...
IRJET- Image Caption Generation System using Neural Network with Attention Me...
IRJET Journal
 
Machine learning based augmented reality for improved learning application th...
Machine learning based augmented reality for improved learning application th...Machine learning based augmented reality for improved learning application th...
Machine learning based augmented reality for improved learning application th...
IJECEIAES
 
Novel Hybrid Approach to Visual Concept Detection Using Image Annotation
Novel Hybrid Approach to Visual Concept Detection Using Image AnnotationNovel Hybrid Approach to Visual Concept Detection Using Image Annotation
Novel Hybrid Approach to Visual Concept Detection Using Image Annotation
CSCJournals
 
Dc31472476
Dc31472476Dc31472476
Dc31472476
IJMER
 
An Intelligent Approach for Effective Retrieval of Content from Large Data Se...
An Intelligent Approach for Effective Retrieval of Content from Large Data Se...An Intelligent Approach for Effective Retrieval of Content from Large Data Se...
An Intelligent Approach for Effective Retrieval of Content from Large Data Se...
IJCSIS Research Publications
 
IMAGE CAPTIONING USING TRANSFORMER: VISIONAID
IMAGE CAPTIONING USING TRANSFORMER: VISIONAIDIMAGE CAPTIONING USING TRANSFORMER: VISIONAID
IMAGE CAPTIONING USING TRANSFORMER: VISIONAID
IRJET Journal
 
Image captioning using DL and NLP.pptx
Image captioning using DL and NLP.pptxImage captioning using DL and NLP.pptx
Image captioning using DL and NLP.pptx
MrUnknown820784
 
Optimal text-to-image synthesis model for generating portrait images using ge...
Optimal text-to-image synthesis model for generating portrait images using ge...Optimal text-to-image synthesis model for generating portrait images using ge...
Optimal text-to-image synthesis model for generating portrait images using ge...
nooriasukmaningtyas
 
Image super resolution using Generative Adversarial Network.
Image super resolution using Generative Adversarial Network.Image super resolution using Generative Adversarial Network.
Image super resolution using Generative Adversarial Network.
IRJET Journal
 
IRJET - Deep Learning Approach to Inpainting and Outpainting System
IRJET -  	  Deep Learning Approach to Inpainting and Outpainting SystemIRJET -  	  Deep Learning Approach to Inpainting and Outpainting System
IRJET - Deep Learning Approach to Inpainting and Outpainting System
IRJET Journal
 
Methodology for eliminating plain regions from captured images
Methodology for eliminating plain regions from captured imagesMethodology for eliminating plain regions from captured images
Methodology for eliminating plain regions from captured images
IAESIJAI
 
Overview of Video Concept Detection using (CNN) Convolutional Neural Network
Overview of Video Concept Detection using (CNN) Convolutional Neural NetworkOverview of Video Concept Detection using (CNN) Convolutional Neural Network
Overview of Video Concept Detection using (CNN) Convolutional Neural Network
IRJET Journal
 
IRJET- Visual Information Narrator using Neural Network
IRJET- Visual Information Narrator using Neural NetworkIRJET- Visual Information Narrator using Neural Network
IRJET- Visual Information Narrator using Neural Network
IRJET Journal
 
Cartoonization of images using machine Learning
Cartoonization of images using machine LearningCartoonization of images using machine Learning
Cartoonization of images using machine Learning
IRJET Journal
 
A Graph-based Web Image Annotation for Large Scale Image Retrieval
A Graph-based Web Image Annotation for Large Scale Image RetrievalA Graph-based Web Image Annotation for Large Scale Image Retrieval
A Graph-based Web Image Annotation for Large Scale Image Retrieval
IRJET Journal
 
A deep learning based stereo matching model for autonomous vehicle
A deep learning based stereo matching model for autonomous vehicleA deep learning based stereo matching model for autonomous vehicle
A deep learning based stereo matching model for autonomous vehicle
IAESIJAI
 
Paper id 25201491
Paper id 25201491Paper id 25201491
Paper id 25201491
IJRAT
 
IRJET - Content based Image Classification
IRJET -  	  Content based Image ClassificationIRJET -  	  Content based Image Classification
IRJET - Content based Image Classification
IRJET Journal
 
IRJET- Saliency based Image Co-Segmentation
IRJET- Saliency based Image Co-SegmentationIRJET- Saliency based Image Co-Segmentation
IRJET- Saliency based Image Co-Segmentation
IRJET Journal
 
IRJET-MText Extraction from Images using Convolutional Neural Network
IRJET-MText Extraction from Images using Convolutional Neural NetworkIRJET-MText Extraction from Images using Convolutional Neural Network
IRJET-MText Extraction from Images using Convolutional Neural Network
IRJET Journal
 
IRJET- Image Caption Generation System using Neural Network with Attention Me...
IRJET- Image Caption Generation System using Neural Network with Attention Me...IRJET- Image Caption Generation System using Neural Network with Attention Me...
IRJET- Image Caption Generation System using Neural Network with Attention Me...
IRJET Journal
 
Machine learning based augmented reality for improved learning application th...
Machine learning based augmented reality for improved learning application th...Machine learning based augmented reality for improved learning application th...
Machine learning based augmented reality for improved learning application th...
IJECEIAES
 
Novel Hybrid Approach to Visual Concept Detection Using Image Annotation
Novel Hybrid Approach to Visual Concept Detection Using Image AnnotationNovel Hybrid Approach to Visual Concept Detection Using Image Annotation
Novel Hybrid Approach to Visual Concept Detection Using Image Annotation
CSCJournals
 
Dc31472476
Dc31472476Dc31472476
Dc31472476
IJMER
 

More from Nathan Mathis (20)

Page Borders Design, Border Design, Baby Clip Art, Fre
Page Borders Design, Border Design, Baby Clip Art, FrePage Borders Design, Border Design, Baby Clip Art, Fre
Page Borders Design, Border Design, Baby Clip Art, Fre
Nathan Mathis
 
How To Write Your Essays In Less Minutes Using This Website Doy News
How To Write Your Essays In Less Minutes Using This Website Doy NewsHow To Write Your Essays In Less Minutes Using This Website Doy News
How To Write Your Essays In Less Minutes Using This Website Doy News
Nathan Mathis
 
Lined Paper For Beginning Writers Writing Paper Prin
Lined Paper For Beginning Writers Writing Paper PrinLined Paper For Beginning Writers Writing Paper Prin
Lined Paper For Beginning Writers Writing Paper Prin
Nathan Mathis
 
Term Paper Example Telegraph
Term Paper Example TelegraphTerm Paper Example Telegraph
Term Paper Example Telegraph
Nathan Mathis
 
Unusual How To Start Off A Compare And Contrast Essay
Unusual How To Start Off A Compare And Contrast EssayUnusual How To Start Off A Compare And Contrast Essay
Unusual How To Start Off A Compare And Contrast Essay
Nathan Mathis
 
How To Write A Methodology Essay, Essay Writer, Essa
How To Write A Methodology Essay, Essay Writer, EssaHow To Write A Methodology Essay, Essay Writer, Essa
How To Write A Methodology Essay, Essay Writer, Essa
Nathan Mathis
 
Recolectar 144 Imagem Educational Background Ex
Recolectar 144 Imagem Educational Background ExRecolectar 144 Imagem Educational Background Ex
Recolectar 144 Imagem Educational Background Ex
Nathan Mathis
 
Microsoft Word Lined Paper Template
Microsoft Word Lined Paper TemplateMicrosoft Word Lined Paper Template
Microsoft Word Lined Paper Template
Nathan Mathis
 
Owl Writing Paper
Owl Writing PaperOwl Writing Paper
Owl Writing Paper
Nathan Mathis
 
The Essay Writing Process Essays
The Essay Writing Process EssaysThe Essay Writing Process Essays
The Essay Writing Process Essays
Nathan Mathis
 
How To Make A Cover Page For Assignment Guide - As
How To Make A Cover Page For Assignment Guide - AsHow To Make A Cover Page For Assignment Guide - As
How To Make A Cover Page For Assignment Guide - As
Nathan Mathis
 
Awesome Creative Writing Essays Thatsnotus
Awesome Creative Writing Essays ThatsnotusAwesome Creative Writing Essays Thatsnotus
Awesome Creative Writing Essays Thatsnotus
Nathan Mathis
 
Sites That Write Papers For You. Websites That Write Essays For You
Sites That Write Papers For You. Websites That Write Essays For YouSites That Write Papers For You. Websites That Write Essays For You
Sites That Write Papers For You. Websites That Write Essays For You
Nathan Mathis
 
4.4 How To Organize And Arrange - Hu
4.4 How To Organize And Arrange - Hu4.4 How To Organize And Arrange - Hu
4.4 How To Organize And Arrange - Hu
Nathan Mathis
 
Essay Written In First Person
Essay Written In First PersonEssay Written In First Person
Essay Written In First Person
Nathan Mathis
 
My Purpose In Life Free Essay Example
My Purpose In Life Free Essay ExampleMy Purpose In Life Free Essay Example
My Purpose In Life Free Essay Example
Nathan Mathis
 
The Structure Of An Outline For A Research Paper, Including Text
The Structure Of An Outline For A Research Paper, Including TextThe Structure Of An Outline For A Research Paper, Including Text
The Structure Of An Outline For A Research Paper, Including Text
Nathan Mathis
 
What Are Some Topics For Exemplification Essays - Quora
What Are Some Topics For Exemplification Essays - QuoraWhat Are Some Topics For Exemplification Essays - Quora
What Are Some Topics For Exemplification Essays - Quora
Nathan Mathis
 
Please Comment, Like, Or Re-Pin For Later Bibliogra
Please Comment, Like, Or Re-Pin For Later BibliograPlease Comment, Like, Or Re-Pin For Later Bibliogra
Please Comment, Like, Or Re-Pin For Later Bibliogra
Nathan Mathis
 
Ide Populer Word In English, Top
Ide Populer Word In English, TopIde Populer Word In English, Top
Ide Populer Word In English, Top
Nathan Mathis
 
Page Borders Design, Border Design, Baby Clip Art, Fre
Page Borders Design, Border Design, Baby Clip Art, FrePage Borders Design, Border Design, Baby Clip Art, Fre
Page Borders Design, Border Design, Baby Clip Art, Fre
Nathan Mathis
 
How To Write Your Essays In Less Minutes Using This Website Doy News
How To Write Your Essays In Less Minutes Using This Website Doy NewsHow To Write Your Essays In Less Minutes Using This Website Doy News
How To Write Your Essays In Less Minutes Using This Website Doy News
Nathan Mathis
 
Lined Paper For Beginning Writers Writing Paper Prin
Lined Paper For Beginning Writers Writing Paper PrinLined Paper For Beginning Writers Writing Paper Prin
Lined Paper For Beginning Writers Writing Paper Prin
Nathan Mathis
 
Term Paper Example Telegraph
Term Paper Example TelegraphTerm Paper Example Telegraph
Term Paper Example Telegraph
Nathan Mathis
 
Unusual How To Start Off A Compare And Contrast Essay
Unusual How To Start Off A Compare And Contrast EssayUnusual How To Start Off A Compare And Contrast Essay
Unusual How To Start Off A Compare And Contrast Essay
Nathan Mathis
 
How To Write A Methodology Essay, Essay Writer, Essa
How To Write A Methodology Essay, Essay Writer, EssaHow To Write A Methodology Essay, Essay Writer, Essa
How To Write A Methodology Essay, Essay Writer, Essa
Nathan Mathis
 
Recolectar 144 Imagem Educational Background Ex
Recolectar 144 Imagem Educational Background ExRecolectar 144 Imagem Educational Background Ex
Recolectar 144 Imagem Educational Background Ex
Nathan Mathis
 
Microsoft Word Lined Paper Template
Microsoft Word Lined Paper TemplateMicrosoft Word Lined Paper Template
Microsoft Word Lined Paper Template
Nathan Mathis
 
The Essay Writing Process Essays
The Essay Writing Process EssaysThe Essay Writing Process Essays
The Essay Writing Process Essays
Nathan Mathis
 
How To Make A Cover Page For Assignment Guide - As
How To Make A Cover Page For Assignment Guide - AsHow To Make A Cover Page For Assignment Guide - As
How To Make A Cover Page For Assignment Guide - As
Nathan Mathis
 
Awesome Creative Writing Essays Thatsnotus
Awesome Creative Writing Essays ThatsnotusAwesome Creative Writing Essays Thatsnotus
Awesome Creative Writing Essays Thatsnotus
Nathan Mathis
 
Sites That Write Papers For You. Websites That Write Essays For You
Sites That Write Papers For You. Websites That Write Essays For YouSites That Write Papers For You. Websites That Write Essays For You
Sites That Write Papers For You. Websites That Write Essays For You
Nathan Mathis
 
4.4 How To Organize And Arrange - Hu
4.4 How To Organize And Arrange - Hu4.4 How To Organize And Arrange - Hu
4.4 How To Organize And Arrange - Hu
Nathan Mathis
 
Essay Written In First Person
Essay Written In First PersonEssay Written In First Person
Essay Written In First Person
Nathan Mathis
 
My Purpose In Life Free Essay Example
My Purpose In Life Free Essay ExampleMy Purpose In Life Free Essay Example
My Purpose In Life Free Essay Example
Nathan Mathis
 
The Structure Of An Outline For A Research Paper, Including Text
The Structure Of An Outline For A Research Paper, Including TextThe Structure Of An Outline For A Research Paper, Including Text
The Structure Of An Outline For A Research Paper, Including Text
Nathan Mathis
 
What Are Some Topics For Exemplification Essays - Quora
What Are Some Topics For Exemplification Essays - QuoraWhat Are Some Topics For Exemplification Essays - Quora
What Are Some Topics For Exemplification Essays - Quora
Nathan Mathis
 
Please Comment, Like, Or Re-Pin For Later Bibliogra
Please Comment, Like, Or Re-Pin For Later BibliograPlease Comment, Like, Or Re-Pin For Later Bibliogra
Please Comment, Like, Or Re-Pin For Later Bibliogra
Nathan Mathis
 
Ide Populer Word In English, Top
Ide Populer Word In English, TopIde Populer Word In English, Top
Ide Populer Word In English, Top
Nathan Mathis
 
Ad

Recently uploaded (20)

How to Manage Cross Selling in Odoo 18 Sales
How to Manage Cross Selling in Odoo 18 SalesHow to Manage Cross Selling in Odoo 18 Sales
How to Manage Cross Selling in Odoo 18 Sales
Celine George
 
How to Configure Extra Steps During Checkout in Odoo 18 Website
How to Configure Extra Steps During Checkout in Odoo 18 WebsiteHow to Configure Extra Steps During Checkout in Odoo 18 Website
How to Configure Extra Steps During Checkout in Odoo 18 Website
Celine George
 
GENERAL QUIZ PRELIMS | QUIZ CLUB OF PSGCAS | 4 MARCH 2025 .pdf
GENERAL QUIZ PRELIMS | QUIZ CLUB OF PSGCAS | 4 MARCH 2025 .pdfGENERAL QUIZ PRELIMS | QUIZ CLUB OF PSGCAS | 4 MARCH 2025 .pdf
GENERAL QUIZ PRELIMS | QUIZ CLUB OF PSGCAS | 4 MARCH 2025 .pdf
Quiz Club of PSG College of Arts & Science
 
MICROBIAL GENETICS -tranformation and tranduction.pdf
MICROBIAL GENETICS -tranformation and tranduction.pdfMICROBIAL GENETICS -tranformation and tranduction.pdf
MICROBIAL GENETICS -tranformation and tranduction.pdf
DHARMENDRA SAHU
 
YSPH VMOC Special Report - Measles Outbreak Southwest US 5-17-2025 .pptx
YSPH VMOC Special Report - Measles Outbreak  Southwest US 5-17-2025  .pptxYSPH VMOC Special Report - Measles Outbreak  Southwest US 5-17-2025  .pptx
YSPH VMOC Special Report - Measles Outbreak Southwest US 5-17-2025 .pptx
Yale School of Public Health - The Virtual Medical Operations Center (VMOC)
 
libbys peer assesment.docx..............
libbys peer assesment.docx..............libbys peer assesment.docx..............
libbys peer assesment.docx..............
19lburrell
 
Botany Assignment Help Guide - Academic Excellence
Botany Assignment Help Guide - Academic ExcellenceBotany Assignment Help Guide - Academic Excellence
Botany Assignment Help Guide - Academic Excellence
online college homework help
 
Conditions for Boltzmann Law – Biophysics Lecture Slide
Conditions for Boltzmann Law – Biophysics Lecture SlideConditions for Boltzmann Law – Biophysics Lecture Slide
Conditions for Boltzmann Law – Biophysics Lecture Slide
PKLI-Institute of Nursing and Allied Health Sciences Lahore , Pakistan.
 
2025 The Senior Landscape and SET plan preparations.pptx
2025 The Senior Landscape and SET plan preparations.pptx2025 The Senior Landscape and SET plan preparations.pptx
2025 The Senior Landscape and SET plan preparations.pptx
mansk2
 
IPL QUIZ | THE QUIZ CLUB OF PSGCAS | 2025.pdf
IPL QUIZ | THE QUIZ CLUB OF PSGCAS | 2025.pdfIPL QUIZ | THE QUIZ CLUB OF PSGCAS | 2025.pdf
IPL QUIZ | THE QUIZ CLUB OF PSGCAS | 2025.pdf
Quiz Club of PSG College of Arts & Science
 
ITI COPA Question Paper PDF 2017 Theory MCQ
ITI COPA Question Paper PDF 2017 Theory MCQITI COPA Question Paper PDF 2017 Theory MCQ
ITI COPA Question Paper PDF 2017 Theory MCQ
SONU HEETSON
 
Module 1: Foundations of Research
Module 1: Foundations of ResearchModule 1: Foundations of Research
Module 1: Foundations of Research
drroxannekemp
 
Module_2_Types_and_Approaches_of_Research (2).pptx
Module_2_Types_and_Approaches_of_Research (2).pptxModule_2_Types_and_Approaches_of_Research (2).pptx
Module_2_Types_and_Approaches_of_Research (2).pptx
drroxannekemp
 
PUBH1000 Slides - Module 11: Governance for Health
PUBH1000 Slides - Module 11: Governance for HealthPUBH1000 Slides - Module 11: Governance for Health
PUBH1000 Slides - Module 11: Governance for Health
JonathanHallett4
 
Cyber security COPA ITI MCQ Top Questions
Cyber security COPA ITI MCQ Top QuestionsCyber security COPA ITI MCQ Top Questions
Cyber security COPA ITI MCQ Top Questions
SONU HEETSON
 
"Bridging Cultures Through Holiday Cards: 39 Students Celebrate Global Tradit...
"Bridging Cultures Through Holiday Cards: 39 Students Celebrate Global Tradit..."Bridging Cultures Through Holiday Cards: 39 Students Celebrate Global Tradit...
"Bridging Cultures Through Holiday Cards: 39 Students Celebrate Global Tradit...
AlionaBujoreanu
 
114P_English.pdf114P_English.pdf114P_English.pdf
114P_English.pdf114P_English.pdf114P_English.pdf114P_English.pdf114P_English.pdf114P_English.pdf
114P_English.pdf114P_English.pdf114P_English.pdf
paulinelee52
 
Peer Assesment- Libby.docx..............
Peer Assesment- Libby.docx..............Peer Assesment- Libby.docx..............
Peer Assesment- Libby.docx..............
19lburrell
 
materi 3D Augmented Reality dengan assemblr
materi 3D Augmented Reality dengan assemblrmateri 3D Augmented Reality dengan assemblr
materi 3D Augmented Reality dengan assemblr
fatikhatunnajikhah1
 
Mental Health Assessment in 5th semester bsc. nursing and also used in 2nd ye...
Mental Health Assessment in 5th semester bsc. nursing and also used in 2nd ye...Mental Health Assessment in 5th semester bsc. nursing and also used in 2nd ye...
Mental Health Assessment in 5th semester bsc. nursing and also used in 2nd ye...
parmarjuli1412
 
How to Manage Cross Selling in Odoo 18 Sales
How to Manage Cross Selling in Odoo 18 SalesHow to Manage Cross Selling in Odoo 18 Sales
How to Manage Cross Selling in Odoo 18 Sales
Celine George
 
How to Configure Extra Steps During Checkout in Odoo 18 Website
How to Configure Extra Steps During Checkout in Odoo 18 WebsiteHow to Configure Extra Steps During Checkout in Odoo 18 Website
How to Configure Extra Steps During Checkout in Odoo 18 Website
Celine George
 
MICROBIAL GENETICS -tranformation and tranduction.pdf
MICROBIAL GENETICS -tranformation and tranduction.pdfMICROBIAL GENETICS -tranformation and tranduction.pdf
MICROBIAL GENETICS -tranformation and tranduction.pdf
DHARMENDRA SAHU
 
libbys peer assesment.docx..............
libbys peer assesment.docx..............libbys peer assesment.docx..............
libbys peer assesment.docx..............
19lburrell
 
Botany Assignment Help Guide - Academic Excellence
Botany Assignment Help Guide - Academic ExcellenceBotany Assignment Help Guide - Academic Excellence
Botany Assignment Help Guide - Academic Excellence
online college homework help
 
2025 The Senior Landscape and SET plan preparations.pptx
2025 The Senior Landscape and SET plan preparations.pptx2025 The Senior Landscape and SET plan preparations.pptx
2025 The Senior Landscape and SET plan preparations.pptx
mansk2
 
ITI COPA Question Paper PDF 2017 Theory MCQ
ITI COPA Question Paper PDF 2017 Theory MCQITI COPA Question Paper PDF 2017 Theory MCQ
ITI COPA Question Paper PDF 2017 Theory MCQ
SONU HEETSON
 
Module 1: Foundations of Research
Module 1: Foundations of ResearchModule 1: Foundations of Research
Module 1: Foundations of Research
drroxannekemp
 
Module_2_Types_and_Approaches_of_Research (2).pptx
Module_2_Types_and_Approaches_of_Research (2).pptxModule_2_Types_and_Approaches_of_Research (2).pptx
Module_2_Types_and_Approaches_of_Research (2).pptx
drroxannekemp
 
PUBH1000 Slides - Module 11: Governance for Health
PUBH1000 Slides - Module 11: Governance for HealthPUBH1000 Slides - Module 11: Governance for Health
PUBH1000 Slides - Module 11: Governance for Health
JonathanHallett4
 
Cyber security COPA ITI MCQ Top Questions
Cyber security COPA ITI MCQ Top QuestionsCyber security COPA ITI MCQ Top Questions
Cyber security COPA ITI MCQ Top Questions
SONU HEETSON
 
"Bridging Cultures Through Holiday Cards: 39 Students Celebrate Global Tradit...
"Bridging Cultures Through Holiday Cards: 39 Students Celebrate Global Tradit..."Bridging Cultures Through Holiday Cards: 39 Students Celebrate Global Tradit...
"Bridging Cultures Through Holiday Cards: 39 Students Celebrate Global Tradit...
AlionaBujoreanu
 
114P_English.pdf114P_English.pdf114P_English.pdf
114P_English.pdf114P_English.pdf114P_English.pdf114P_English.pdf114P_English.pdf114P_English.pdf
114P_English.pdf114P_English.pdf114P_English.pdf
paulinelee52
 
Peer Assesment- Libby.docx..............
Peer Assesment- Libby.docx..............Peer Assesment- Libby.docx..............
Peer Assesment- Libby.docx..............
19lburrell
 
materi 3D Augmented Reality dengan assemblr
materi 3D Augmented Reality dengan assemblrmateri 3D Augmented Reality dengan assemblr
materi 3D Augmented Reality dengan assemblr
fatikhatunnajikhah1
 
Mental Health Assessment in 5th semester bsc. nursing and also used in 2nd ye...
Mental Health Assessment in 5th semester bsc. nursing and also used in 2nd ye...Mental Health Assessment in 5th semester bsc. nursing and also used in 2nd ye...
Mental Health Assessment in 5th semester bsc. nursing and also used in 2nd ye...
parmarjuli1412
 
Ad

ATTENTION BASED IMAGE CAPTIONING USING DEEP LEARNING

  • 1. International Journal of Innovative Research in Advanced Engineering (IJIRAE) ISSN: 2349-2163 Issue 12, Volume 8 (December 2021) https:/ / www.ijirae.com/ archives ____________________________________________________________________________________________________________________________________ IJIRAE:: © 2014-22, AM Publications, India - All Rights Reserved Page -379 ATTENTION BASED IMAGE CAPTIONINGUSINGDEEP LEARNING Tejaswini Nakirekanti* Dept of CSE, Mahatma Gandhi Institute of Technology, Hyderabad-500075, Telangana, India tejaswininakirekanti@gmail.com Deepika D Assistant Professor, Dept of CSE, Mahatma Gandhi Institute of Technology, Hyderabad-500075, Telangana, India ddeepika_cse@mgit.ac.in Publication History Manuscript Reference No: IJIRAE/ RS/ Vol.08/ Issue12/ DCAE10087 Research Article | Open Access Peer-review: Double-blind Peer-reviewed Article ID: IJIRAE/ RS/ Vol.08/ Issue12/ DCAE10087 Received: 20, December 2021 Accepted: 29, December 2021 Published Online: 30, December 2021 Volume 2021 | Article ID DCAE10087 http:/ / www.ijirae.com/ volumes/ Vol8/ iss-12/ 09.DCAE10087.pdf Citation: Tejaswini,Deepika (2021). Attention Based image Captioning using Deep Learning. International Journal of Innovative Research in Advanced Engineering, VIII, 379-387 doi: https:/ / doi.org/ 10.26562/ ijirae.2021.v0812.009 Editor-Chief: Dr.A.Arul Lawrence Selvakumar, Chief Editor, IJIRAE, AM Publications, India Copyright: © 2021 This is an open access article distributed under the terms of the Creative Commons Attribution License; Which Permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited Abstract: The process of deriving written descriptions from a picture based on the items and behaviours portrayed is known as image caption generation. The resulting description will represent what is in the image and the interactions among the objects. However, like with any other image processing challenge, recreating this characteristic in an artificial system is a difficult feat, and so the adoption Deep Learning procedures can help to tackle the problem. The main goal of this project is to generate an image captioning model that can provide a concise and relatable description of the picture by incorporating an attention mechanism. Ideally, the model can be separated into two components – one is called Encoder and the other is a Decoder. The encoder used here is a Google InceptionV3 pre-trained model. And the decoder is a language-based model, Gated Recurrent Unit (GRU) to translate the characteristics and objects provided by the encoder into natural sentences. Furthermore, the model's effectiveness is aided by the attention mechanism. To conclude, the model is trained on a subset of the MS COCO dataset, which contains about 80,000 pictures with at least five descriptions each and evaluated based on the BLEU score. To reduce the loss, the model weights are tuned using the categorical cross-entropy loss function. The results of the model are promising and competitive. Keywords: CNN-RNN Architecture, Inception V3, Attention-mechanism, Gated recurrent unit (GRU), BLEU-metric. I. INTRODUCTION A computer can't tell what is there in a picture, unlike humans. We human beings can able to differentiate actions in the world based on our knowledge. For a computer to analyse and differentiate the objects and actions in an image is hard but not impossible. Deep learning advancements have led to the development of a model which can generate captions for an image with the help of a large collection of data and computation power. This helps to give some meaning to huge data, which also creates an impact in future research. Many applications can benefit from a model that provides descriptions for images automatically. Enhancing the accuracy of search engines, self-driving automobiles, identification, and vision applications are just a few examples. Image captioning began to rely on deep learning for its recent advances. Usually, the encoder-decoder framework is employed, where CNN is used as an encoder to extract information from images. RNN is employed as a decoder to convert this representation into a natural language description. A CNN, on the other hand, must condense all of the input data and then sent to the decoder. It is possible that compressing input sequences will result in data loss. The CNN - RNN image captioning architecture includes an attention to focus on one part of the image while generating a single word. The visual attention aids in reducing the impact of irrelevant words while generating a single word at a point of time that is appropriate to the current context.
  • 2. International Journal of Innovative Research in Advanced Engineering (IJIRAE) ISSN: 2349-2163 Issue 12, Volume 8 (December 2021) https:/ / www.ijirae.com/ archives ______________________________________________________________________________________________________________________________________ IJIRAE:: © 2014-20, AM Publications, India - All Rights Reserved Page -380 A. Problem Definition Human participation is usually required to annotate images, making it a near-impossible process for large commercial databases. Deep learning can aid in the dynamic annotation of these photos. Image captioning aims to create a logical and holistic explanation of an image's content. The project's purpose is to create an image captioning model that uses attention to construct a description from an input image. B. Existing System From retrieval-based and template-based methods, image captioning has grown rapidly in recent years. Retrieval based methods tries to extract similar-captioned images to the query image by leveraging distance metric and combine all the retrieved captions together to generate new caption. In the future, this approach may not be able to produce new descriptions. In case of template based methods, it maintains a collection of sentence templates which work according to the scenario in the image and tries to fill the template by depicting particular objects or actions present in the target image. The main drawback of this technique is it cannot able to generate variable length captions even though they are grammatically correct. C. Proposed System The proposed model employs a CNN to encode a picture which contains the nuances of the image. This is subsequently utilized to build a context vector by the attention layer. Later the RNN decodes the information produced by the attention model into a sequential, comprehensible statement. II. LITERATURE SURVEY Potential research work carried out on various techniques for the image caption generation using different techniques and has been discussed in the following paragraphs. Various researchers have employed different mechanisms for captioning the image and to find out the most useful features. (Genc Hoxha et. al.) Proposed an image captioning framework that combines generation and retrieval-based image captioning in this research. Multiple descriptions are generated by combining CNN-RNN architecture with beam- searching. Instead of generating a single visual description, beam search assists in the generation of many descriptions. The experiments are performed on RSCID dataset. The use of more advanced lexical similarity can improve the lexical similarity between the generated and reference descriptions. [1] (Ansar Hani et. al.) Put forth an encoder-decoder structure for an attention-based picture captioning model. The attention mechanism helps you focus on certain parts of a visual rather than the complete picture. The encoder is a pre-trained Google InceptionV3 model, and the decoder is a Gated Recurrent Unit (GRU). The encoder uses CNN to extract the image's features, while the decoder derives relevant descriptions. The model was tested by MSCOCO dataset, and the outcomes were provided. The model's is assessed using Adam's adaptive gradient optimizer. [2] (Soheyla Amirian et. al.) Created a GAN network, which comprises a Descriptor and a Generator, in this study. The Variational Autoencoder (VAE) is employed which helps in mapping the input to a distribution. Auto-encoders are neural networks that learn data encodings without being supervised. The model added sentiment and diversity to the captions to generate more human-like descriptions, rather than the predefined templates. The model is evaluated using MSCOCO dataset. [3] (Niange Yu et. al.) Introduced an image-captioning architecture in this study that constructs descriptions based on a given topic. The multi-label classifier selects the subjects in a given image. The model takes an image-topic pair as input and yields a natural sentence for the image as output. [4] (Fang Fang et. al.) Developed a unique word-level attention layer to process visual attributes utilizing two modules for reliable word prediction. The first module is a bidirectional spatial-embedding module for handling region proposals, and the second extracts word-level attention, which is subsequently fed into the language model via an attention approach. A recurrent neural network termed as Long-Short Term Memory (LSTM) is used in the language model. The nuances are extracted using a convolutional neural network (CNN) from a picture. [5] III. DESIGN METHODOLOGY The purpose of picture caption extraction is to provide a detailed and comprehensive description of the image's content. When extracting the sentence, it evaluates not just the objects in the image, but also their relationships to create sequences. An attention oriented CNN-RNN framework is used to accomplish this. The following is a general outline of the work involved in creating captions for images:  Model Architecture  Data Collection and preprocessing  Model Building A. Model Architecture Fig. 1 depicts the complete system framework. The dataset used for the system is MSCOCO dataset which contains images along with five descriptions per picture [6]. Later pre-processing is done to generate a vocabulary set using the dataset captions of the images and also resizing the images. The encoder-decoder framework along with the attention mechanism is included in the model building. The goal of the attention model is to boost the model's efficiency.
  • 3. International Journal of Innovative Research in Advanced Engineering (IJIRAE) ISSN: 2349-2163 Issue 12, Volume 8 (December 2021) https:/ / www.ijirae.com/ archives ______________________________________________________________________________________________________________________________________ IJIRAE:: © 2014-20, AM Publications, India - All Rights Reserved Page -381 The Google InceptionV3 model as an encoder is used to extract the nuances from an image and a RNN Gated Recurrent Unit (GRU) is used to generate meaningful sequences. And the model efficiency is evaluated using the BLEU metric. Fig. 1 Flowchart of model architecture. B. Data Collection and Preprocessing Although there are various resources, MSCOCO dataset is chosen which is very popular and known for its large scale object detection tasks is used to train the model. This dataset comprises a collection of photos of daily activities along with their captions. It contains over 80,000 photos, each of which is accompanied by at least five user- generated phrases describing the scenario depicted in the photograph. Fig. 2 shows a picture from the dataset, which consist of five reference descriptions. Fig. 2 A snapshot of an image from dataset. 1). Data Preparation and Preprocessing: The dataset has to be altered into a layout suitable for training the model. As the dataset contains a different kind of images and descriptions there is a need for preparation and pre- processing before being fed into the model. 2). Preprocessing Images: The dataset contains images of different dimensions; the images have to be scaled in such a way that it is suitable for the model encoder. The encoder of the project accepts images of shape 299x299x3. Each image must be scaled to the appropriate shape before feeding to the model. 3). Preprocessing Descriptions: The descriptions of the dataset contain English sentences. From these sentences, a vocabulary has to be created based on the word frequency. This can be done by pre-processing techniques, where the sentences are converted into words by eliminating punctuations. Later top 5000 most occurred words are extracted and named as vocabulary. Machine learning models only accept vectors (arrays of numbers) as input. The textual input which is nothing but strings has to be transformed into numbers (or the text must be "vectorized") before being fed to the model. To do that word_index and index_word dictionaries are maintained as shown in Fig. 3 and Fig. 4, where each word is associated with a unique number and each number is associated with a unique word [8]. These dictionaries are used to generate a sequence of numbers for sentences and vice versa, which are fed into the network while training, testing, evaluating, and predicting. 4). Divide data into train and test sets: The purpose for dividing the data is straightforward: to prevent the model from over fitting and to appropriately evaluate the model, it must be separated to train, validation, and test splits.
  • 4. International Journal of Innovative Research in Advanced Engineering (IJIRAE) ISSN: 2349-2163 Issue 12, Volume 8 (December 2021) https:/ / www.ijirae.com/ archives ______________________________________________________________________________________________________________________________________ IJIRAE:: © 2014-20, AM Publications, India - All Rights Reserved Page -382 Fig. 3 Snap-shot of Word_index dictionary Fig. 4 Snap_shot of Index_word dictionary. C. Model Building Attention-based CNN - RNN architecture is employed to create the framework. The recommended model's architecture is depicted in Fig. 5. Fig. 5 The model architecture. 1) Encoder: The model's encoder is a pre-trained convolutional neural network called Google InceptionV3, which produces a representation of the input image. This model was placed second in ILSVRC2015 [9] image classification competition. The last layer was removed as it is used for classification. The characteristics are extracted from the pre-processed images using the Inception V3 model. Fig. 6 shows the 42-layer structural design of the InceptioV3 model. Inception-v3 suggested a novel inception model that combines several different sized convolutional filters into a single filter. As a result of this design, the number of parameters that must be taught is reduced, and the computational complexity is reduced. The encoder accepts images as input of shape 299x299x3 and gives the output of shape 8x8x2048. A fully connected layer is used at the end to convert into a one-dimensional array of size 1001 which is used for classification. The attention mechanism is implemented by focusing on certain parts of the image while generating sequences rather than considering the full image. A fully connected layer representation does not capture the exact locations of the image to achieve an attention mechanism as it is one-dimensional.
  • 5. International Journal of Innovative Research in Advanced Engineering (IJIRAE) ISSN: 2349-2163 Issue 12, Volume 8 (December 2021) https:/ / www.ijirae.com/ archives ______________________________________________________________________________________________________________________________________ IJIRAE:: © 2014-20, AM Publications, India - All Rights Reserved Page -383 Fig. 6 Architecture of InceptionV3 model. To achieve attention-mechanism, the last layer of architecture is removed and the hidden layer which has an output shape of 8x8x2048 is considered as the output layer. This output shape will retain the location information of the image and becomes easy to focus on certain regions while generating sequences [10]. Let us consider a CNN network as shown in Fig. 7 where the input is an image and output is a one-dimensional vector of shape 1000. This vector may not exactly capture the locations. For example, in Fig. 8 the output is from the convolution layer which captures the locations of the original image. Fig. 7 A convolutional layer with 1-dimensional output. Fig. 8 A convolutional layer with 3-dimensional output capturing the locations of the image. 2) Attention-Mechanism: When creating descriptions for images, local attention likes to concentrate on particular areas of a given image. The decoder would pay attention to specific parts of the image considering the hidden state at time t, and construct a context vector utilizing the nuances of the image from CNN's convolution layer. Giving the entire image as input to the RNN to produce sequences is not always possible, and it may result in inappropriate captions. Instead of using the CNN's fully connected layer, one of the convolutional layers' output, which contains spatial information about the image, is used as input to the attention model. The encoder output is of shape 8x8x2048 is reshaped as 64x2048 and given as input. The semantic attention- mechanism utilised will automatically reduce the weights of insignificant words to eliminate the influence of non- relevant features terms. These attention weights are given as input to the language model which helps in generating sequences. As a result, these weights direct the model to generate relevant phrases that focus on the image's objects. Fig. 9 depicts how the attention mechanism contributes to the generation of captions from a given image. The model has only analysed and highlighted significant aspects of the image to generate captions. Firstly, the section containing the man would be taken into consideration. Then the section around it, that is the section containing Snow Mountain. Thus, the caption generated here would be "A person riding skis on a mountain".
  • 6. International Journal of Innovative Research in Advanced Engineering (IJIRAE) ISSN: 2349-2163 Issue 12, Volume 8 (December 2021) https:/ / www.ijirae.com/ archives ______________________________________________________________________________________________________________________________________ IJIRAE:: © 2014-20, AM Publications, India - All Rights Reserved Page -384 Fig. 9 Working of attention mechanism. Decoder: The decoder is a language model which converts the nuances extracted from an image to textual descriptions. GRU is preferably employed because of its structure and it does not have the problem of vanishing gradients. The attention layer produces a vector of one dimension which is fed to GRU. The GRU is made up of two gates: an update gate and a reset gate [11]. The purpose of the update gate is to educate the model about "how much of the previous information has to be kept, i.e., passed on to the future”. The goal of the reset gate is to figure out "how much of the past data should be forgotten." In each level, the weights are adjusted according to these gates. Fig. 10 depicts the architecture of the GRU, which is used to generate sequences based on experience. Fig. 10 Architecture of Gated Recurrent Unit (GRU). IV. TESTINGAND RESULTS Tensor Flow 2.0 is used to create the model. Resize the supplied photos to 299 by 299 pixels. Later, the last convolutional layer's 8 x 8 x 2048 activation maps are used as annotations. The loss plot is represented in Fig. 11. The loss function is used to reduce the loss and tune the model weights accordingly. Fig. 11 Loss Plot. A. Model Evaluation The BLEU metric [7] is used to evaluate the model. The Bilingual Evaluation Understudy, or BLEU, is a metric for comparing a translated text to one or perhaps more reference translations. It was designed for translation, but it may also be used to compare the text generated by the model to a reference or real captions.
  • 7. International Journal of Innovative Research in Advanced Engineering (IJIRAE) ISSN: 2349-2163 Issue 12, Volume 8 (December 2021) https:/ / www.ijirae.com/ archives ______________________________________________________________________________________________________________________________________ IJIRAE:: © 2014-20, AM Publications, India - All Rights Reserved Page -385 The statistic is based on a scale of 0 to 1. Regardless of the word order, the comparison is made. Fig. 12 represents the predicted caption and a BLEU score for the image. Fig. 12 A picture and its result from the test set. B. Results Some results of the model are given below, the model captions the given image based on the items in the picture and the interactions among them. Fig. 13 shows the results predicted by the model from the test set where it has the real caption, predicted caption, BLEU score and attention plot of the given image. Which has the generated caption appropriate to the content of the image. Fig. 13 Result of a picture from the test set. As the model is trained with only 10,000 images from the dataset it is obvious that the model may generate inappropriate captions as shown in Fig. 14. Fig. 14 Result of a picture from the test set.
  • 8. International Journal of Innovative Research in Advanced Engineering (IJIRAE) ISSN: 2349-2163 Issue 12, Volume 8 (December 2021) https:/ / www.ijirae.com/ archives ______________________________________________________________________________________________________________________________________ IJIRAE:: © 2014-20, AM Publications, India - All Rights Reserved Page -386 Let us see the generated captions by the model for pictures other than the train set and test set. Fig. 15 shows the appropriate caption generated for the given image, Fig. 16 shows the predicted caption for the given image which is quite inappropriate. Fig. 15 Appropriate caption for the given image. Fig. 16 Inappropriate caption for the given image. V. CONCLUSION AND FUTURE WORK A. Conclusion An image captioning model which integrates attention is described to construct a natural sentence from a given image. The characteristics of an image are extracted be employing a convolutional neural network, and is then given to attention layer to generate a context-vector. Later, it is decoded by a recurrent neural network into understandable sequences. The attention mechanism aids in the optimization of findings as well as the creation of appropriate captions. The model is trained using a subset of the MS COCO dataset, and the number of training samples must be increased to improve the model's performance. The model can also be trained on a variety of datasets. The model is assessed using the BLEU metric, which considers translation length, the word choice, and word order to assess how similar the machine translation is to the human reference translation. When compared to previous research, the outcomes achieved are promising and competitive. B. Future Work For future work, implementing a more compact deep convolution network may help to enhance the efficiency of the proposed model. The recommended model can generate only one caption but it can be increased with the help of the beam search where the model generates more than one caption for an image. In the future, the proposed paradigm could be updated to include video captioning. That is, the video's information will be summarized in sequences. REFERENCES 1. Farid Melgani, Genc Hoxha, Jacopo Slaghenauffi. (2020). A New CNN-RNN Framework For Remote Sensing Image Captioning. 2020 Mediterranean and Middle-East Geoscience and Remote Sensing Symposium (M2GARSS). 2. Ansar Hani, Najiba Tagougui, Monji Kherallah. (2019). Image Caption Generation Using A Deep Architecture. 2019 International Arab Conference on Information Technology (ACIT) deep. 3. Soheyla Amirian, Khaled Rasheed, Thiab R. Taha, and Hamid R. Arabnia. (2019). Image Captioning with Generative Adversarial Network. 2019 International Conference on Computational Science and Computational Intelligence (CSCI).
  • 9. International Journal of Innovative Research in Advanced Engineering (IJIRAE) ISSN: 2349-2163 Issue 12, Volume 8 (December 2021) https:/ / www.ijirae.com/ archives ______________________________________________________________________________________________________________________________________ IJIRAE:: © 2014-20, AM Publications, India - All Rights Reserved Page -387 4. Niange Yu, Xiaolin Hu, Binheng Song, Jian Yang, and Jianwei Zhang. (2018). Topic-Oriented Image Captioning Based on Order-Embedding. IEEE Transactions on Image Processing (Volume: 28, Issue: 6, June 2019). 5. Fang Fang, Hanli Wang, Pengjie Tang. (2018). Image Captioning with word level attention. 2018 25th IEEE International Conference on Image Processing (ICIP). 6. Dataset resource link. Available: https:/ / cocodataset.org/ #home. 7. Jason Brownlee, A Gentle Introduction to Calculating the BLEU Score for Text in Python, Deep Learning for Natural Language Processing, November 20, 2017. 8. Rishi Sidhu, Turn words into numbers - The NLP way, Sep 1, 2020. 9. Sik-Ho Tsang, Inception-v3 — 1st Runner Up (Image Classification) in ILSVRC2015,Sep 10,2018. 10.NPTEL-NOC IITM. “Deep Learning (CS7015): Lec 15.4 Attention over images,” YouTube, Oct 24, 2018 [Video File]. 11.Available:https:/ / www.youtube.com/ watch?v=hvhqHhrP_AU&list=PLyqSpQzTE6M9gCgajvQbc68Hk_JKGBAYT& index=119. [Accessed:Apr 20, 2021]. 12.Simeon Kostadinov, Understanding GRU Networks, Dec 16, 2017.
  翻译: