Regular Constrained Multimodal Fusion for Image Captioning

被引:0
|
作者
Wang, Liya [1 ]
Chen, Haipeng [2 ]
Liu, Yu [2 ]
Lyu, Yingda [3 ]
机构
[1] Jilin University, College of Software, Key Laboratory of Symbolic Computation and Knowledge Engineering of Ministry of Education, Changchun,130012, China
[2] Jilin University, College of Computer Science and Technology, Key Laboratory of Symbolic Computation and Knowledge Engineering of Ministry of Education, Changchun,130012, China
[3] Jilin University, Center for Public Education Research, Changchun,130012, China
关键词
D O I
10.1109/TCSVT.2024.3425513
中图分类号
学科分类号
摘要
More diverse and closer to human-like captions are of paramount importance in image captioning. Recent research has achieved significant advancements, with the majority adopting end-to-end encoder-decoder architectures that integrate specific feature-text processing. However, the homogeneity of their model structures, the simplicity or complexity of feature-text fusion, and the uniformity of training objectives have all to some extent affected the diversity and effectiveness of caption generation, thus limiting the potential applications of this task. Therefore, in this paper, we propose the Regular Constrained Multimodal Fusion (RCMF) method for image captioning to better integrate information across and within modalities, while also approaching human-like fine-grained semantic perception and relationship reasoning capabilities. Initially, our RCMF preprocesses images using a Swin-Transformer and then an extended encoder with a new intra-modal fusion module, utilizing window-focused linear attention to capture features and leveraging refined grid and global visual features. By combining text features, RCMF employs a cross-modal fusion module and decoder to deeply model the interaction between text and image. Additionally, RCMF first introduces a new additional regulatory modal fusion reasoning (MFR) branch, which surpasses the above architectures. Its MFR loss combined with cross-entropy loss forms a new training objective strategy, effectively mining fine-grained relationships between images and text, perceiving the semantic information of images and their corresponding captions, thereby regulating the generated captions to be more diverse and human-like. Experimental results based on the MS COCO 2014 dataset, particularly under the same experimental conditions, demonstrate the outstanding performance of our method, especially in terms of METEOR, ROUGE-L, CIDEr, and SPICE metrics. Visualization results further intuitively confirm the effectiveness of our RCMF method. Source code in https://github.com/200084/RCMF-for-image-caption. © 2024 IEEE.
引用
收藏
页码:11900 / 11913
相关论文
共 50 条
  • [1] A multimodal fusion approach for image captioning
    Zhao, Dexin
    Chang, Zhi
    Guo, Shutao
    [J]. NEUROCOMPUTING, 2019, 329 : 476 - 485
  • [2] Multimodal Image Captioning for Marketing Analysis
    Harzig, Philipp
    Brehm, Stephan
    Lienhart, Rainer
    Kaiser, Carolin
    Schallner, Rene
    [J]. IEEE 1ST CONFERENCE ON MULTIMEDIA INFORMATION PROCESSING AND RETRIEVAL (MIPR 2018), 2018, : 158 - 161
  • [3] MMT: A Multimodal Translator for Image Captioning
    Liu, Chang
    Sun, Fuchun
    Wang, Changhu
    [J]. ARTIFICIAL NEURAL NETWORKS AND MACHINE LEARNING, PT II, 2017, 10614 : 784 - 784
  • [4] Improving multimodal datasets with image captioning
    Thao Nguyen
    Gadre, Samir Yitzhak
    Ilharco, Gabriel
    Oh, Sewoong
    Schmidt, Ludwig
    [J]. ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 36 (NEURIPS 2023), 2023,
  • [5] MAT: A Multimodal Attentive Translator for Image Captioning
    Liu, Chang
    Sun, Fuchun
    Wang, Changhu
    Wang, Feng
    Yuille, Alan
    [J]. PROCEEDINGS OF THE TWENTY-SIXTH INTERNATIONAL JOINT CONFERENCE ON ARTIFICIAL INTELLIGENCE, 2017, : 4033 - 4039
  • [6] Effective Multimodal Encoding for Image Paragraph Captioning
    Nguyen, Thanh-Son
    Fernando, Basura
    [J]. IEEE TRANSACTIONS ON IMAGE PROCESSING, 2022, 31 : 6381 - 6395
  • [7] Cascade Semantic Fusion for Image Captioning
    Wang, Shiwei
    Lan, Long
    Zhang, Xiang
    Dong, Guohua
    Luo, Zhigang
    [J]. IEEE ACCESS, 2019, 7 : 66680 - 66688
  • [8] Recurrent Fusion Network for Image Captioning
    Jiang, Wenhao
    Ma, Lin
    Jiang, Yu-Gang
    Liu, Wei
    Zhang, Tong
    [J]. COMPUTER VISION - ECCV 2018, PT II, 2018, 11206 : 510 - 526
  • [9] Adaptive Syncretic Attention for Constrained Image Captioning
    Liang Yang
    Haifeng Hu
    [J]. Neural Processing Letters, 2019, 50 : 549 - 564
  • [10] Adaptive Syncretic Attention for Constrained Image Captioning
    Yang, Liang
    Hu, Haifeng
    [J]. NEURAL PROCESSING LETTERS, 2019, 50 (01) : 549 - 564