GLOVE-ING ATTENTION: A MULTI-MODAL NEURAL LEARNING APPROACH TO IMAGE CAPTIONING

被引:0
|
作者
Anundskas, Lars Halvor [1 ]
Afridi, Hina [1 ,3 ]
Tarekegn, Adane Nega [1 ]
Yamin, Muhammad Mudassar [2 ]
Ullah, Mohib [1 ]
Yamin, Saira [2 ]
Cheikh, Faouzi Alaya [1 ]
机构
[1] Norwegian Univ Sci & Technol NTNU, Dept Comp Sci, Trondheim, Norway
[2] Dept Management Sci, CUI Wah Campus, Wah Cantt, Pakistan
[3] Geno SA, Storhamargata 44, N-2317 Hamar, Norway
关键词
Chest X-ray; Convolutional neural networks; attention; GloVe embeddings; gated recurrent units;
D O I
10.1109/ICASSPW59220.2023.10193011
中图分类号
O42 [声学];
学科分类号
070206 ; 082403 ;
摘要
Articulating pictures using natural language is a complex undertaking within the realm of computer vision. The process of generating image captions involves producing depictions of images which can be achieved through advanced learning frameworks utilizing convolutional neural networks (CNNs) and recurrent neural networks (RNNs). Yet, conventional RNNs face challenges such as gradient explosion and vanishing gradients, resulting in inferior outcomes when producing non-evocative sentences. In this paper, we proposed an encoder-decoder deep neural network to generate image captions using state-of-the-art backbone architecture EfficientNet as the encoder network. We used multimodal gated recurrent units (GrU) for the decoder, which incorporate GloVe word embeddings for the text data and visual attention for the image data. The network is trained on three different datasets, Indiana Chest X-ray, COCO and WIT, and the results are evaluated on the standard performance metrics of BLEU and METEOR. The quantitative results show that the network achieves promising results compared to the state-of-the-art models. The source code is publically available at https://bitbucket.org/larswise/ imagecaptioning/src/master/wit_pipeline/.
引用
收藏
页数:5
相关论文
共 50 条
  • [31] A Multi-Modal Deep Learning Approach for Emotion Recognition
    Shahzad, H. M.
    Bhatti, Sohail Masood
    Jaffar, Arfan
    Rashid, Muhammad
    INTELLIGENT AUTOMATION AND SOFT COMPUTING, 2023, 36 (02): : 1561 - 1570
  • [32] HATF: Multi-Modal Feature Learning for Infrared and Visible Image Fusion via Hybrid Attention Transformer
    Liu, Xiangzeng
    Wang, Ziyao
    Gao, Haojie
    Li, Xiang
    Wang, Lei
    Miao, Qiguang
    REMOTE SENSING, 2024, 16 (05)
  • [33] Single-shot hyperspectral imaging based on dual attention neural network with multi-modal learning
    He, Tianyue
    Zhang, Qican
    Zhou, Mingwei
    Kou, Tingdong
    Shen, Junfei
    OPTICS EXPRESS, 2022, 30 (06) : 9790 - 9813
  • [34] Learning Cross-modal Representations with Multi-relations for Image Captioning
    Cheng, Peng
    Le, Tung
    Racharak, Teeradaj
    Cao Yiming
    Kong Weikun
    Minh Le Nguyen
    PROCEEDINGS OF THE 11TH INTERNATIONAL CONFERENCE ON PATTERN RECOGNITION APPLICATIONS AND METHODS (ICPRAM), 2021, : 346 - 353
  • [35] A Multi-Modal Hashing Learning Framework for Automatic Image Annotation
    Wang, Jiale
    Li, Guohui
    2017 IEEE SECOND INTERNATIONAL CONFERENCE ON DATA SCIENCE IN CYBERSPACE (DSC), 2017, : 14 - 21
  • [36] Multi-modal deep convolutional dictionary learning for image denoising
    Sun, Zhonggui
    Zhang, Mingzhu
    Sun, Huichao
    Li, Jie
    Liu, Tingting
    Gao, Xinbo
    NEUROCOMPUTING, 2023, 562
  • [37] Learning Nonrigid Deformations for Constrained Multi-modal Image Registration
    Onofrey, John A.
    Staib, Lawrence H.
    Papademetris, Xenophon
    MEDICAL IMAGE COMPUTING AND COMPUTER-ASSISTED INTERVENTION (MICCAI 2013), PT III, 2013, 8151 : 171 - 178
  • [38] Multi-modal self-paced learning for image classification
    Xu, Wei
    Liu, Wei
    Huang, Xiaolin
    Yang, Jie
    Qiu, Song
    NEUROCOMPUTING, 2018, 309 : 134 - 144
  • [39] Learning Confidence Measures by Multi-modal Convolutional Neural Networks
    Fu, Zehua
    Ardabilian Fard, Mohsen
    2018 IEEE WINTER CONFERENCE ON APPLICATIONS OF COMPUTER VISION (WACV 2018), 2018, : 1321 - 1330
  • [40] LEARNING OPTIMAL SHAPE REPRESENTATIONS FOR MULTI-MODAL IMAGE REGISTRATION
    Grossiord, Eloise
    Risser, Laurent
    Kanoun, Salim
    Ken, Soleakhena
    Malgouyres, Francois
    2020 IEEE 17TH INTERNATIONAL SYMPOSIUM ON BIOMEDICAL IMAGING (ISBI 2020), 2020, : 722 - 725