Semantic Embedding Guided Attention with Explicit Visual Feature Fusion for Video Captioning

被引:2
|
作者
Dong, Shanshan [1 ]
Niu, Tianzi [1 ]
Luo, Xin [1 ]
Liu, Wu [2 ]
Xu, Xinshun [1 ]
机构
[1] Shandong Univ, Sch Software, Jinan 250101, Peoples R China
[2] JD AI Res, Beijing, Peoples R China
基金
中国国家自然科学基金;
关键词
Video captioning; semantic embedding guided attention; explicit visual feature fusion;
D O I
10.1145/3550276
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
Video captioning, which bridges vision and language, is a fundamental yet challenging task in computer vision. To generate accurate and comprehensive sentences, both visual and semantic information is quite important. However, most existing methods simply concatenate different types of features and ignore the interactions between them. In addition, there is a large semantic gap between visual feature space and semantic embedding space, making the task very challenging. To address these issues, we propose a framework named semantic embedding guided attention with Explicit visual Feature Fusion for vidEo CapTioning, EFFECT for short, in which we design an explicit visual-feature fusion (EVF) scheme to capture the pairwise interactions between multiple visual modalities and fuse multimodal visual features of videos in an explicit way. Furthermore, we propose a novel attention mechanism called semantic embedding guided attention (SEGA), which cooperates with the temporal attention to generate a joint attention map. Specifically, in SEGA, the semantic word embedding information is leveraged to guide the model to pay more attention to the most correlated visual features at each decoding stage. In this way, the semantic gap between visual and semantic space is alleviated to some extent. To evaluate the proposed model, we conduct extensive experiments on two widely used datasets, i.e., MSVD and MSR-VTT. The experimental results demonstrate that our approach achieves state-of-the-art results in terms of four evaluation metrics.
引用
收藏
页数:18
相关论文
共 50 条
  • [1] Video Captioning Based on Cascaded Attention-Guided Visual Feature Fusion
    Shuqin Chen
    Li Yang
    Yikang Hu
    [J]. Neural Processing Letters, 2023, 55 (8) : 11509 - 11526
  • [2] Video Captioning Based on Cascaded Attention-Guided Visual Feature Fusion
    Chen, Shuqin
    Yang, Li
    Hu, Yikang
    [J]. NEURAL PROCESSING LETTERS, 2023, 55 (08) : 11509 - 11526
  • [3] Modeling Context-Guided Visual and Linguistic Semantic Feature for Video Captioning
    Sun, Zhixin
    Zhong, Xian
    Chen, Shuqin
    Liu, Wenxuan
    Feng, Duxiu
    Li, Lin
    [J]. ARTIFICIAL NEURAL NETWORKS AND MACHINE LEARNING, ICANN 2021, PT V, 2021, 12895 : 677 - 689
  • [4] Semantic Enhanced Video Captioning with Multi-feature Fusion
    Niu, Tian-Zi
    Dong, Shan-Shan
    Chen, Zhen-Duo
    Luo, Xin
    Guo, Shanqing
    Huang, Zi
    Xu, Xin-Shun
    [J]. ACM TRANSACTIONS ON MULTIMEDIA COMPUTING COMMUNICATIONS AND APPLICATIONS, 2023, 19 (06)
  • [5] Towards accurate unsupervised video captioning with implicit visual feature injection and explicit
    Zhang, Yunjie
    Xu, Tianyang
    Song, Xiaoning
    Zhu, Xue-Feng
    Feng, Zhenghua
    Wu, Xiao-Jun
    [J]. PATTERN RECOGNITION LETTERS, 2024, 183 : 133 - 139
  • [6] Video Captioning with Visual and Semantic Features
    Lee, Sujin
    Kim, Incheol
    [J]. JOURNAL OF INFORMATION PROCESSING SYSTEMS, 2018, 14 (06): : 1318 - 1330
  • [7] Fine-grained and Semantic-guided Visual Attention for Image Captioning
    Zhang, Zongjian
    Wu, Qiang
    Wang, Yang
    Chen, Fang
    [J]. 2018 IEEE WINTER CONFERENCE ON APPLICATIONS OF COMPUTER VISION (WACV 2018), 2018, : 1709 - 1717
  • [8] Attention-guided image captioning with adaptive global and local feature fusion
    Zhong, Xian
    Nie, Guozhang
    Huang, Wenxin
    Liu, Wenxuan
    Ma, Bo
    Lin, Chia-Wen
    [J]. JOURNAL OF VISUAL COMMUNICATION AND IMAGE REPRESENTATION, 2021, 78
  • [9] Object semantic-guided graph attention feature fusion network for Siamese visual tracking
    Zhang, Jianwei
    Miao, Mengen
    Zhang, Huanlong
    Wang, Jingchao
    Zhao, Yanchun
    Chen, Zhiwu
    Qiao, Jianwei
    [J]. JOURNAL OF VISUAL COMMUNICATION AND IMAGE REPRESENTATION, 2023, 90
  • [10] MULTIMODAL SEMANTIC ATTENTION NETWORK FOR VIDEO CAPTIONING
    Sun, Liang
    Li, Bing
    Yuan, Chunfeng
    Zha, Zhengjun
    Hu, Weiming
    [J]. 2019 IEEE INTERNATIONAL CONFERENCE ON MULTIMEDIA AND EXPO (ICME), 2019, : 1300 - 1305