Improving Image Captioning through Visual and Semantic Mutual Promotion

被引:0
|
作者
Zhang, Jing [1 ]
Xie, Yingshuai [1 ]
Liu, Xiaoqiang [1 ]
机构
[1] East China Univ Sci & Technol, Shanghai, Peoples R China
基金
上海市自然科学基金;
关键词
Image Captioning; Transformer; Co-attention; Multimodal Fusion;
D O I
10.1145/3581783.3612480
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Current image captioning methods commonly use semantic attributes extracted by an object detector to guide visual representation, leaving the mutual guidance and enhancement between vision and semantics under-explored. Neurological studies have revealed that the visual cortex of the brain plays a crucial role in recognizing visual objects, while the prefrontal cortex is involved in the integration of contextual semantics. Inspired by the above studies, we propose a novel Visual-Semantic Transformer (VST) to model the neural interaction between vision and semantics, which explores the mechanism of deep fusion and mutual promotion of multimodal information, realizing more accurate image captioning. To better facilitate the complementary strengths between visual objects and semantic contexts, we propose a global position-sensitive co-attention encoder to realize globally associative, position-aware visual and semantic co-interaction through a mutual cross-attention mechanism. In addition, a multimodal mixed attention module is proposed in the decoder, which achieves adaptive multimodal feature fusion for enhancing the decoding capability. Experimental evidence shows that our VST significantly surpasses the state-of-the-art approaches on MSCOCO dataset and reaches the excellent CIDEr score of 142% on the Karpathy test split.
引用
收藏
页码:4716 / 4724
页数:9
相关论文
共 50 条
  • [31] Visual Relationship Attention for Image Captioning
    Zhang, Zongjian
    Wu, Qiang
    Wang, Yang
    Chen, Fang
    2019 INTERNATIONAL JOINT CONFERENCE ON NEURAL NETWORKS (IJCNN), 2019,
  • [32] Improving multimodal datasets with image captioning
    Thao Nguyen
    Gadre, Samir Yitzhak
    Ilharco, Gabriel
    Oh, Sewoong
    Schmidt, Ludwig
    ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 36 (NEURIPS 2023), 2023,
  • [33] Integrating Scene Semantic Knowledge into Image Captioning
    Wei, Haiyang
    Li, Zhixin
    Huang, Feicheng
    Zhang, Canlong
    Ma, Huifang
    Shi, Zhongzhi
    ACM TRANSACTIONS ON MULTIMEDIA COMPUTING COMMUNICATIONS AND APPLICATIONS, 2021, 17 (02)
  • [34] A Context Semantic Auxiliary Network for Image Captioning
    Li, Jianying
    Shao, Xiangjun
    INFORMATION, 2023, 14 (07)
  • [35] Semantic interdisciplinary evaluation of image captioning models
    Sirisha, Uddagiri
    Chandana, Bolem Sai
    COGENT ENGINEERING, 2022, 9 (01):
  • [36] Visual Cluster Grounding for Image Captioning
    Jiang, Wenhui
    Zhu, Minwei
    Fang, Yuming
    Shi, Guangming
    Zhao, Xiaowei
    Liu, Yang
    IEEE TRANSACTIONS ON IMAGE PROCESSING, 2022, 31 : 3920 - 3934
  • [37] Bengali Image Captioning with Visual Attention
    Ami, Amit Saha
    Humaira, Mayeesha
    Jim, Md Abidur Rahman Khan
    Paul, Shimul
    Shah, Faisal Muhammad
    2020 23RD INTERNATIONAL CONFERENCE ON COMPUTER AND INFORMATION TECHNOLOGY (ICCIT 2020), 2020,
  • [38] A visual persistence model for image captioning
    Wang, Yiyu
    Xu, Jungang
    Sun, Yingfei
    NEUROCOMPUTING, 2022, 468 : 48 - 59
  • [39] Visual enhanced gLSTM for image captioning
    Zhang, Jing
    Li, Kangkang
    Wang, Zhenkun
    Zhao, Xianwen
    Wang, Zhe
    EXPERT SYSTEMS WITH APPLICATIONS, 2021, 184
  • [40] Exploring Visual Relationship for Image Captioning
    Yao, Ting
    Pan, Yingwei
    Li, Yehao
    Mei, Tao
    COMPUTER VISION - ECCV 2018, PT XIV, 2018, 11218 : 711 - 727