GLOVE-ING ATTENTION: A MULTI-MODAL NEURAL LEARNING APPROACH TO IMAGE CAPTIONING

被引:0
|
作者
Anundskas, Lars Halvor [1 ]
Afridi, Hina [1 ,3 ]
Tarekegn, Adane Nega [1 ]
Yamin, Muhammad Mudassar [2 ]
Ullah, Mohib [1 ]
Yamin, Saira [2 ]
Cheikh, Faouzi Alaya [1 ]
机构
[1] Norwegian Univ Sci & Technol NTNU, Dept Comp Sci, Trondheim, Norway
[2] Dept Management Sci, CUI Wah Campus, Wah Cantt, Pakistan
[3] Geno SA, Storhamargata 44, N-2317 Hamar, Norway
关键词
Chest X-ray; Convolutional neural networks; attention; GloVe embeddings; gated recurrent units;
D O I
10.1109/ICASSPW59220.2023.10193011
中图分类号
O42 [声学];
学科分类号
070206 ; 082403 ;
摘要
Articulating pictures using natural language is a complex undertaking within the realm of computer vision. The process of generating image captions involves producing depictions of images which can be achieved through advanced learning frameworks utilizing convolutional neural networks (CNNs) and recurrent neural networks (RNNs). Yet, conventional RNNs face challenges such as gradient explosion and vanishing gradients, resulting in inferior outcomes when producing non-evocative sentences. In this paper, we proposed an encoder-decoder deep neural network to generate image captions using state-of-the-art backbone architecture EfficientNet as the encoder network. We used multimodal gated recurrent units (GrU) for the decoder, which incorporate GloVe word embeddings for the text data and visual attention for the image data. The network is trained on three different datasets, Indiana Chest X-ray, COCO and WIT, and the results are evaluated on the standard performance metrics of BLEU and METEOR. The quantitative results show that the network achieves promising results compared to the state-of-the-art models. The source code is publically available at https://bitbucket.org/larswise/ imagecaptioning/src/master/wit_pipeline/.
引用
收藏
页数:5
相关论文
共 50 条
  • [1] Multi-Modal Image Captioning for the Visually Impaired
    Ahsan, Hiba
    Bhalla, Nikita
    Bhatt, Daivat
    Shah, Kaivankumar
    2021 CONFERENCE OF THE NORTH AMERICAN CHAPTER OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS: HUMAN LANGUAGE TECHNOLOGIES (NAACL-HLT 2021), 2021, : 53 - 60
  • [2] Multi-Modal Graph Aggregation Transformer for image captioning
    Chen, Lizhi
    Li, Kesen
    NEURAL NETWORKS, 2025, 181
  • [3] Contextualized Keyword Representations for Multi-modal Retinal Image Captioning
    Huang, Jia-Hong
    Wu, Ting-Wei
    Worring, Marcel
    PROCEEDINGS OF THE 2021 INTERNATIONAL CONFERENCE ON MULTIMEDIA RETRIEVAL (ICMR '21), 2021, : 645 - 652
  • [4] AMC: Attention guided Multi-modal Correlation Learning for Image Search
    Chen, Kan
    Bui, Trung
    Fang, Chen
    Wang, Zhaowen
    Nevatia, Ram
    30TH IEEE CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2017), 2017, : 6203 - 6211
  • [5] MULTI-MODAL HIERARCHICAL ATTENTION-BASED DENSE VIDEO CAPTIONING
    Munusamy, Hemalatha
    Sekhar, Chandra C.
    2023 IEEE INTERNATIONAL CONFERENCE ON IMAGE PROCESSING, ICIP, 2023, : 475 - 479
  • [6] Cross-modal attention for multi-modal image registration
    Song, Xinrui
    Chao, Hanqing
    Xu, Xuanang
    Guo, Hengtao
    Xu, Sheng
    Turkbey, Baris
    Wood, Bradford J.
    Sanford, Thomas
    Wang, Ge
    Yan, Pingkun
    MEDICAL IMAGE ANALYSIS, 2022, 82
  • [7] Multi-modal Dense Video Captioning
    Iashin, Vladimir
    Rahtu, Esa
    2020 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION WORKSHOPS (CVPRW 2020), 2020, : 4117 - 4126
  • [8] Attention driven multi-modal similarity learning
    Gao, Xinjian
    Mu, Tingting
    Goulermas, John Y.
    Wang, Meng
    INFORMATION SCIENCES, 2018, 432 : 530 - 542
  • [9] Multi-Modal Medical Image Fusion Using Transfer Learning Approach
    Kalamkar, Shrida
    Mary, Geetha A.
    INTERNATIONAL JOURNAL OF ADVANCED COMPUTER SCIENCE AND APPLICATIONS, 2022, 13 (12) : 483 - 488
  • [10] Towards Video Captioning with Naming: A Novel Dataset and a Multi-modal Approach
    Pini, Stefano
    Cornia, Marcella
    Baraldi, Lorenzo
    Cucchiara, Rita
    IMAGE ANALYSIS AND PROCESSING (ICIAP 2017), PT II, 2017, 10485 : 384 - 395