Center-enhanced video captioning model with multimodal semantic alignment

被引：0

作者：

Zhang, Benhui ^{[1
,2
]}

Gao, Junyu ^{[2
,3
]}

Yuan, Yuan ^{[2
]}

机构：

[1] School of Computer Science, Northwestern Polytechnical University, Xi'an,710072, China

[2] School of Artificial Intelligence, Optics and Electronics (iOPEN), Northwestern Polytechnical University, Xi'an,710072, China

[3] Shanghai Artificial Intelligence Laboratory, Shanghai,200232, China

来源：

Neural Networks | 2024年 / 180卷

基金：

中国国家自然科学基金;

关键词：

Video analysis;

D O I：

10.1016/j.neunet.2024.106744

中图分类号：

学科分类号：

摘要：

Video captioning aims at automatically generating descriptive sentences based on the given video, establishing an association between the visual contents and textual languages, has attracted great attention and plays a significant role in many practical applications. Previous researches focus more on the aspect of caption generation, ignoring the alignment of multimodal feature and just simply concatenating them. Besides, video feature extraction is usually done in an off-line manner, which leads to the fact that the extracted feature may not adapted to the subsequent caption generation task. To improve the applicability of extracted features for downstream caption generation and to address the issue of multimodal semantic alignment fusion, we propose an end-to-end center-enhanced video captioning model with multimodal semantic alignment, which integrates feature extraction and caption generation task into a unified framework. In order to enhance the completeness of semantic features, we design a center enhancement strategy where the visual–textual deep joint semantic feature can be captured via incremental clustering, then the cluster centers can serve as the guidance for better caption generation. Moreover, we propose to promote the visual–textual multimodal alignment fusion by learning the visual and textual representation in a shared latent semantic space, so as to alleviate the multimodal misalignment problem. Experimental results on two popular datasets MSVD and MSR-VTT demonstrate that the proposed model could outperform the state-of-the-art methods, obtaining higher-quality caption results. © 2024 Elsevier Ltd

引用

共 50 条

[31] Discriminative Latent Semantic Graph for Video Captioning
Bai, Yang
Wang, Junyan
Long, Yang
Hu, Bingzhang
Song, Yang
Pagnucco, Maurice
Guan, Yu
[J]. PROCEEDINGS OF THE 29TH ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA, MM 2021, 2021, : 3556 - 3564
[32] Cascade Semantic Prompt Alignment Network for Image Captioning
Li, Jingyu
Zhang, Lei
Zhang, Kun
Hu, Bo
Xie, Hongtao
Mao, Zhendong
[J]. IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, 2024, 34 (07) : 5266 - 5281
[33] STSI: Efficiently Mine Spatio-Temporal Semantic Information between Different Multimodal for Video Captioning
Xiong, Huiyu
Wang, Lanxiao
[J]. 2022 IEEE INTERNATIONAL CONFERENCE ON VISUAL COMMUNICATIONS AND IMAGE PROCESSING (VCIP), 2022,
[34] End-to-End Video Captioning Based on Multiview Semantic Alignment for Human-Machine Fusion
Wu, Shuai
Gao, Yubing
Yang, Weidong
Li, Hongkai
Zhu, Guangyu
[J]. IEEE TRANSACTIONS ON AUTOMATION SCIENCE AND ENGINEERING, 2024, 22 : 1 - 9
[35] Multimodal attention-based transformer for video captioning
Hemalatha Munusamy
Chandra Sekhar C
[J]. Applied Intelligence, 2023, 53 : 23349 - 23368
[36] Multimodal graph neural network for video procedural captioning
Ji, Lei
Tu, Rongcheng
Lin, Kevin
Wang, Lijuan
Duan, Nan
[J]. NEUROCOMPUTING, 2022, 488 : 88 - 96
[37] Multimodal attention-based transformer for video captioning
Munusamy, Hemalatha
Sekhar, C. Chandra
[J]. APPLIED INTELLIGENCE, 2023, 53 (20) : 23349 - 23368
[38] Concept Parser With Multimodal Graph Learning for Video Captioning
Wu, Bofeng
Liu, Buyu
Huang, Peng
Bao, Jun
Peng, Xi
Yu, Jun
[J]. IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, 2023, 33 (09) : 4484 - 4495
[39] Learning Multimodal Attention LSTM Networks for Video Captioning
Xu, Jun
Yao, Ting
Zhang, Yongdong
Mei, Tao
[J]. PROCEEDINGS OF THE 2017 ACM MULTIMEDIA CONFERENCE (MM'17), 2017, : 537 - 545
[40] Hierarchical Vision-Language Alignment for Video Captioning
Zhang, Junchao
Peng, Yuxin
[J]. MULTIMEDIA MODELING (MMM 2019), PT I, 2019, 11295 : 42 - 54

← 1 2 3 4 5 →