Improving Image Captioning through Visual and Semantic Mutual Promotion

被引:0
|
作者
Zhang, Jing [1 ]
Xie, Yingshuai [1 ]
Liu, Xiaoqiang [1 ]
机构
[1] East China Univ Sci & Technol, Shanghai, Peoples R China
基金
上海市自然科学基金;
关键词
Image Captioning; Transformer; Co-attention; Multimodal Fusion;
D O I
10.1145/3581783.3612480
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Current image captioning methods commonly use semantic attributes extracted by an object detector to guide visual representation, leaving the mutual guidance and enhancement between vision and semantics under-explored. Neurological studies have revealed that the visual cortex of the brain plays a crucial role in recognizing visual objects, while the prefrontal cortex is involved in the integration of contextual semantics. Inspired by the above studies, we propose a novel Visual-Semantic Transformer (VST) to model the neural interaction between vision and semantics, which explores the mechanism of deep fusion and mutual promotion of multimodal information, realizing more accurate image captioning. To better facilitate the complementary strengths between visual objects and semantic contexts, we propose a global position-sensitive co-attention encoder to realize globally associative, position-aware visual and semantic co-interaction through a mutual cross-attention mechanism. In addition, a multimodal mixed attention module is proposed in the decoder, which achieves adaptive multimodal feature fusion for enhancing the decoding capability. Experimental evidence shows that our VST significantly surpasses the state-of-the-art approaches on MSCOCO dataset and reaches the excellent CIDEr score of 142% on the Karpathy test split.
引用
收藏
页码:4716 / 4724
页数:9
相关论文
共 50 条
  • [21] Object semantic analysis for image captioning
    Du, Sen
    Zhu, Hong
    Lin, Guangfeng
    Wang, Dong
    Shi, Jing
    Wang, Jing
    MULTIMEDIA TOOLS AND APPLICATIONS, 2023, 82 (28) : 43179 - 43206
  • [22] Cascade Semantic Fusion for Image Captioning
    Wang, Shiwei
    Lan, Long
    Zhang, Xiang
    Dong, Guohua
    Luo, Zhigang
    IEEE ACCESS, 2019, 7 : 66680 - 66688
  • [23] Improving Visual Reasoning Through Semantic Representation
    Zheng, Wenfeng
    Liu, Xiangjun
    Ni, Xubin
    Yin, Lirong
    Yang, Bo
    IEEE ACCESS, 2021, 9 : 91476 - 91486
  • [24] A novel image captioning model with visual-semantic similarities and visual representations re-weighting
    Thobhani, Alaa
    Zou, Beiji
    Kui, Xiaoyan
    Al-Shargabi, Asma A.
    Derea, Zaid
    Abdussalam, Amr
    Asham, Mohammed A.
    JOURNAL OF KING SAUD UNIVERSITY-COMPUTER AND INFORMATION SCIENCES, 2024, 36 (07)
  • [25] Improving Intra- and Inter-Modality Visual Relation for Image Captioning
    Wang, Yong
    Zhang, WenKai
    Liu, Qing
    Zhang, Zhengyuan
    Gao, Xin
    Sun, Xian
    MM '20: PROCEEDINGS OF THE 28TH ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA, 2020, : 4190 - 4198
  • [26] Improving Diversity of Image Captioning through Variational Autoencoders and Adversarial Learning
    Ren, Li
    Qi, Guo-Jun
    Hua, Kien
    2019 IEEE WINTER CONFERENCE ON APPLICATIONS OF COMPUTER VISION (WACV), 2019, : 263 - 272
  • [27] Improving Image Captioning with Image Concepts of Words
    Wang, Yiyu
    Xiang, Xunzhi
    Jing, Kun
    Xu, Jungang
    Sun, Yingfei
    KNOWLEDGE SCIENCE, ENGINEERING AND MANAGEMENT, PT II, KSEM 2024, 2024, 14885 : 358 - 370
  • [28] Image captioning via semantic element embedding
    Zhang, Xiaodan
    He, Shengfeng
    Song, Xinhang
    Lau, Rynson W. H.
    Jiao, Jianbin
    Ye, Qixiang
    NEUROCOMPUTING, 2020, 395 : 212 - 221
  • [29] Dense semantic embedding network for image captioning
    Xiao, Xinyu
    Wang, Lingfeng
    Ding, Kun
    Xiang, Shiming
    Pan, Chunhong
    PATTERN RECOGNITION, 2019, 90 : 285 - 296
  • [30] StructCap: Structured Semantic Embedding for Image Captioning
    Chen, Fuhai
    Ji, Rongrong
    Su, Jinsong
    Wu, Yongjian
    Wu, Yunsheng
    PROCEEDINGS OF THE 2017 ACM MULTIMEDIA CONFERENCE (MM'17), 2017, : 46 - 54