GLOVE-ING ATTENTION: A MULTI-MODAL NEURAL LEARNING APPROACH TO IMAGE CAPTIONING

被引:0
|
作者
Anundskas, Lars Halvor [1 ]
Afridi, Hina [1 ,3 ]
Tarekegn, Adane Nega [1 ]
Yamin, Muhammad Mudassar [2 ]
Ullah, Mohib [1 ]
Yamin, Saira [2 ]
Cheikh, Faouzi Alaya [1 ]
机构
[1] Norwegian Univ Sci & Technol NTNU, Dept Comp Sci, Trondheim, Norway
[2] Dept Management Sci, CUI Wah Campus, Wah Cantt, Pakistan
[3] Geno SA, Storhamargata 44, N-2317 Hamar, Norway
关键词
Chest X-ray; Convolutional neural networks; attention; GloVe embeddings; gated recurrent units;
D O I
10.1109/ICASSPW59220.2023.10193011
中图分类号
O42 [声学];
学科分类号
070206 ; 082403 ;
摘要
Articulating pictures using natural language is a complex undertaking within the realm of computer vision. The process of generating image captions involves producing depictions of images which can be achieved through advanced learning frameworks utilizing convolutional neural networks (CNNs) and recurrent neural networks (RNNs). Yet, conventional RNNs face challenges such as gradient explosion and vanishing gradients, resulting in inferior outcomes when producing non-evocative sentences. In this paper, we proposed an encoder-decoder deep neural network to generate image captions using state-of-the-art backbone architecture EfficientNet as the encoder network. We used multimodal gated recurrent units (GrU) for the decoder, which incorporate GloVe word embeddings for the text data and visual attention for the image data. The network is trained on three different datasets, Indiana Chest X-ray, COCO and WIT, and the results are evaluated on the standard performance metrics of BLEU and METEOR. The quantitative results show that the network achieves promising results compared to the state-of-the-art models. The source code is publically available at https://bitbucket.org/larswise/ imagecaptioning/src/master/wit_pipeline/.
引用
收藏
页数:5
相关论文
共 50 条
  • [21] Multi-modal Sentence Summarization with Modality Attention and Image Filtering
    Li, Haoran
    Zhu, Junnan
    Liu, Tianshang
    Zhang, Jiajun
    Zong, Chengqing
    PROCEEDINGS OF THE TWENTY-SEVENTH INTERNATIONAL JOINT CONFERENCE ON ARTIFICIAL INTELLIGENCE, 2018, : 4152 - 4158
  • [22] Attention Correctness in Neural Image Captioning
    Liu, Chenxi
    Mao, Junhua
    Sha, Fei
    Yuille, Alan
    THIRTY-FIRST AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE, 2017, : 4176 - 4182
  • [23] Split Learning of Multi-Modal Medical Image Classification
    Ghosh, Bishwamittra
    Wang, Yuan
    Fu, Huazhu
    Wei, Qingsong
    Liu, Yong
    Goh, Rick Siow Mong
    2024 IEEE CONFERENCE ON ARTIFICIAL INTELLIGENCE, CAI 2024, 2024, : 1326 - 1331
  • [24] A generic neural network for multi-modal sensorimotor learning
    Carenzi, F
    Bendahan, P
    Roschin, VY
    Frolov, AA
    Gorce, P
    Maier, MA
    COMPUTATIONAL NEUROSCIENCE: TRENDS IN RESEARCH 2004, 2004, : 525 - 533
  • [25] A generic neural network for multi-modal sensorimotor learning
    Carenzi, F
    Bendahan, P
    Roschin, VY
    Frolov, AA
    Gorce, P
    Maier, MA
    NEUROCOMPUTING, 2004, 58 : 525 - 533
  • [26] Adversarial Learning With Multi-Modal Attention for Visual Question Answering
    Liu, Yun
    Zhang, Xiaoming
    Huang, Feiran
    Cheng, Lei
    Li, Zhoujun
    IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, 2021, 32 (09) : 3894 - 3908
  • [27] A Multi-task Learning Approach for Image Captioning
    Zhao, Wei
    Wang, Benyou
    Ye, Jianbo
    Yang, Min
    Zhao, Zhou
    Luo, Ruotian
    Qiao, Yu
    PROCEEDINGS OF THE TWENTY-SEVENTH INTERNATIONAL JOINT CONFERENCE ON ARTIFICIAL INTELLIGENCE, 2018, : 1205 - 1211
  • [28] Multi-modal anchor adaptation learning for multi-modal summarization
    Chen, Zhongfeng
    Lu, Zhenyu
    Rong, Huan
    Zhao, Chuanjun
    Xu, Fan
    NEUROCOMPUTING, 2024, 570
  • [29] Multi-Modal Medical Image Registration with Full or Partial Data: A Manifold Learning Approach
    Bashiri, Fereshteh S.
    Baghaie, Ahmadreza
    Rostami, Reihaneh
    Yu, Zeyun
    D'Souza, Roshan M.
    JOURNAL OF IMAGING, 2019, 5 (01)
  • [30] Multi-Modal Mutual Attention and Iterative Interaction for Referring Image Segmentation
    Liu, Chang
    Ding, Henghui
    Zhang, Yulun
    Jiang, Xudong
    IEEE TRANSACTIONS ON IMAGE PROCESSING, 2023, 32 : 3054 - 3065