Improving Image Captioning through Visual and Semantic Mutual Promotion

被引:0
|
作者
Zhang, Jing [1 ]
Xie, Yingshuai [1 ]
Liu, Xiaoqiang [1 ]
机构
[1] East China Univ Sci & Technol, Shanghai, Peoples R China
基金
上海市自然科学基金;
关键词
Image Captioning; Transformer; Co-attention; Multimodal Fusion;
D O I
10.1145/3581783.3612480
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Current image captioning methods commonly use semantic attributes extracted by an object detector to guide visual representation, leaving the mutual guidance and enhancement between vision and semantics under-explored. Neurological studies have revealed that the visual cortex of the brain plays a crucial role in recognizing visual objects, while the prefrontal cortex is involved in the integration of contextual semantics. Inspired by the above studies, we propose a novel Visual-Semantic Transformer (VST) to model the neural interaction between vision and semantics, which explores the mechanism of deep fusion and mutual promotion of multimodal information, realizing more accurate image captioning. To better facilitate the complementary strengths between visual objects and semantic contexts, we propose a global position-sensitive co-attention encoder to realize globally associative, position-aware visual and semantic co-interaction through a mutual cross-attention mechanism. In addition, a multimodal mixed attention module is proposed in the decoder, which achieves adaptive multimodal feature fusion for enhancing the decoding capability. Experimental evidence shows that our VST significantly surpasses the state-of-the-art approaches on MSCOCO dataset and reaches the excellent CIDEr score of 142% on the Karpathy test split.
引用
收藏
页码:4716 / 4724
页数:9
相关论文
共 50 条
  • [1] Image Captioning with Visual-Semantic LSTM
    Li, Nannan
    Chen, Zhenzhong
    PROCEEDINGS OF THE TWENTY-SEVENTH INTERNATIONAL JOINT CONFERENCE ON ARTIFICIAL INTELLIGENCE, 2018, : 793 - 799
  • [2] Image Captioning Based on Visual and Semantic Attention
    Wei, Haiyang
    Li, Zhixin
    Zhang, Canlong
    MULTIMEDIA MODELING (MMM 2020), PT I, 2020, 11961 : 151 - 162
  • [3] Improving Visual Question Answering by Image Captioning
    Shao, Xiangjun
    Dong, Hongsong
    Wu, Guangsheng
    IEEE ACCESS, 2025, 13 : 46299 - 46311
  • [4] Image Captioning With Visual-Semantic Double Attention
    He, Chen
    Hu, Haifeng
    ACM TRANSACTIONS ON MULTIMEDIA COMPUTING COMMUNICATIONS AND APPLICATIONS, 2019, 15 (01)
  • [5] Aligned visual semantic scene graph for image captioning
    Zhao, Shanshan
    Li, Lixiang
    Peng, Haipeng
    DISPLAYS, 2022, 74
  • [6] Boosting convolutional image captioning with semantic content and visual relationship
    Bai, Cong
    Zheng, Anqi
    Huang, Yuan
    Pan, Xiang
    Chen, Nan
    DISPLAYS, 2021, 70
  • [7] Aligning Linguistic Words and Visual Semantic Units for Image Captioning
    Guo, Longteng
    Liu, Jing
    Tang, Jinhui
    Li, Jiangwei
    Luo, Wei
    Lu, Hanqing
    PROCEEDINGS OF THE 27TH ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA (MM'19), 2019, : 765 - 773
  • [8] Sentinel mechanism for visual semantic graph-based image captioning
    Xiao, Fen
    Zhang, Ningru
    Xue, Wenfeng
    Gao, Xieping
    COMPUTERS & ELECTRICAL ENGINEERING, 2024, 119
  • [9] Matching Visual Features to Hierarchical Semantic Topics for Image Paragraph Captioning
    Dandan Guo
    Ruiying Lu
    Bo Chen
    Zequn Zeng
    Mingyuan Zhou
    International Journal of Computer Vision, 2022, 130 : 1920 - 1937
  • [10] Modeling visual and word-conditional semantic attention for image captioning
    Wu, Chunlei
    Wei, Yiwei
    Chu, Xiaoliang
    Su, Fei
    Wang, Leiquan
    SIGNAL PROCESSING-IMAGE COMMUNICATION, 2018, 67 : 100 - 107