Semantic Embedding Guided Attention with Explicit Visual Feature Fusion for Video Captioning

被引:2
|
作者
Dong, Shanshan [1 ]
Niu, Tianzi [1 ]
Luo, Xin [1 ]
Liu, Wu [2 ]
Xu, Xinshun [1 ]
机构
[1] Shandong Univ, Sch Software, Jinan 250101, Peoples R China
[2] JD AI Res, Beijing, Peoples R China
基金
中国国家自然科学基金;
关键词
Video captioning; semantic embedding guided attention; explicit visual feature fusion;
D O I
10.1145/3550276
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
Video captioning, which bridges vision and language, is a fundamental yet challenging task in computer vision. To generate accurate and comprehensive sentences, both visual and semantic information is quite important. However, most existing methods simply concatenate different types of features and ignore the interactions between them. In addition, there is a large semantic gap between visual feature space and semantic embedding space, making the task very challenging. To address these issues, we propose a framework named semantic embedding guided attention with Explicit visual Feature Fusion for vidEo CapTioning, EFFECT for short, in which we design an explicit visual-feature fusion (EVF) scheme to capture the pairwise interactions between multiple visual modalities and fuse multimodal visual features of videos in an explicit way. Furthermore, we propose a novel attention mechanism called semantic embedding guided attention (SEGA), which cooperates with the temporal attention to generate a joint attention map. Specifically, in SEGA, the semantic word embedding information is leveraged to guide the model to pay more attention to the most correlated visual features at each decoding stage. In this way, the semantic gap between visual and semantic space is alleviated to some extent. To evaluate the proposed model, we conduct extensive experiments on two widely used datasets, i.e., MSVD and MSR-VTT. The experimental results demonstrate that our approach achieves state-of-the-art results in terms of four evaluation metrics.
引用
收藏
页数:18
相关论文
共 50 条
  • [41] Parallel-fusion LSTM with synchronous semantic and visual information for image captioning
    Zhang, Jing
    Li, Kangkang
    Wang, Zhe
    [J]. JOURNAL OF VISUAL COMMUNICATION AND IMAGE REPRESENTATION, 2021, 75 (75)
  • [42] Video semantic segmentation via feature propagation with holistic attention
    Wu, Junrong
    Wen, Zongzheng
    Zhao, Sanyuan
    Huang, Kele
    [J]. PATTERN RECOGNITION, 2020, 104
  • [43] Visual attention guided bit allocation in video compression
    Li, Zhicheng
    Qin, Shiyin
    Itti, Laurent
    [J]. IMAGE AND VISION COMPUTING, 2011, 29 (01) : 1 - 14
  • [44] Local feature-based video captioning with multiple classifier and CARU-attention
    Im, Sio-Kei
    Chan, Ka-Hou
    [J]. IET IMAGE PROCESSING, 2024, 18 (09) : 2304 - 2317
  • [45] Feature Fusion Network Based on Hybrid Attention for Semantic Segmentation
    Xie Xinchen
    Li, Chen
    Tian, Lihua
    [J]. 2022 IEEE WORLD AI IOT CONGRESS (AIIOT), 2022, : 9 - 14
  • [46] Self-attention feature fusion network for semantic segmentation
    Zhou, Zhen
    Zhou, Yan
    Wang, Dongli
    Mu, Jinzhen
    Zhou, Haibin
    [J]. NEUROCOMPUTING, 2021, 453 : 50 - 59
  • [47] Lightweight Semantic Segmentation Network based on Attention Feature Fusion
    Kuang, Xianyan
    Liu, Ping
    Chen, Yixi
    Zhang, Jianhua
    [J]. ENGINEERING LETTERS, 2023, 31 (04) : 1584 - 1591
  • [48] Semantic Image Segmentation with Improved Position Attention and Feature Fusion
    Zhu, Hegui
    Miao, Yan
    Zhang, Xiangde
    [J]. NEURAL PROCESSING LETTERS, 2020, 52 (01) : 329 - 351
  • [49] Semantic Image Segmentation with Improved Position Attention and Feature Fusion
    Hegui Zhu
    Yan Miao
    Xiangde Zhang
    [J]. Neural Processing Letters, 2020, 52 : 329 - 351
  • [50] Visual attention guided image fusion with sparse representation
    Yang, Bin
    Li, Shutao
    [J]. OPTIK, 2014, 125 (17): : 4881 - 4888