Center-enhanced video captioning model with multimodal semantic alignment

被引:0
|
作者
Zhang, Benhui [1 ,2 ]
Gao, Junyu [2 ,3 ]
Yuan, Yuan [2 ]
机构
[1] School of Computer Science, Northwestern Polytechnical University, Xi'an,710072, China
[2] School of Artificial Intelligence, Optics and Electronics (iOPEN), Northwestern Polytechnical University, Xi'an,710072, China
[3] Shanghai Artificial Intelligence Laboratory, Shanghai,200232, China
基金
中国国家自然科学基金;
关键词
Video analysis;
D O I
10.1016/j.neunet.2024.106744
中图分类号
学科分类号
摘要
Video captioning aims at automatically generating descriptive sentences based on the given video, establishing an association between the visual contents and textual languages, has attracted great attention and plays a significant role in many practical applications. Previous researches focus more on the aspect of caption generation, ignoring the alignment of multimodal feature and just simply concatenating them. Besides, video feature extraction is usually done in an off-line manner, which leads to the fact that the extracted feature may not adapted to the subsequent caption generation task. To improve the applicability of extracted features for downstream caption generation and to address the issue of multimodal semantic alignment fusion, we propose an end-to-end center-enhanced video captioning model with multimodal semantic alignment, which integrates feature extraction and caption generation task into a unified framework. In order to enhance the completeness of semantic features, we design a center enhancement strategy where the visual–textual deep joint semantic feature can be captured via incremental clustering, then the cluster centers can serve as the guidance for better caption generation. Moreover, we propose to promote the visual–textual multimodal alignment fusion by learning the visual and textual representation in a shared latent semantic space, so as to alleviate the multimodal misalignment problem. Experimental results on two popular datasets MSVD and MSR-VTT demonstrate that the proposed model could outperform the state-of-the-art methods, obtaining higher-quality caption results. © 2024 Elsevier Ltd
引用
收藏
相关论文
共 50 条
  • [31] Discriminative Latent Semantic Graph for Video Captioning
    Bai, Yang
    Wang, Junyan
    Long, Yang
    Hu, Bingzhang
    Song, Yang
    Pagnucco, Maurice
    Guan, Yu
    [J]. PROCEEDINGS OF THE 29TH ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA, MM 2021, 2021, : 3556 - 3564
  • [32] Cascade Semantic Prompt Alignment Network for Image Captioning
    Li, Jingyu
    Zhang, Lei
    Zhang, Kun
    Hu, Bo
    Xie, Hongtao
    Mao, Zhendong
    [J]. IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, 2024, 34 (07) : 5266 - 5281
  • [33] STSI: Efficiently Mine Spatio-Temporal Semantic Information between Different Multimodal for Video Captioning
    Xiong, Huiyu
    Wang, Lanxiao
    [J]. 2022 IEEE INTERNATIONAL CONFERENCE ON VISUAL COMMUNICATIONS AND IMAGE PROCESSING (VCIP), 2022,
  • [34] End-to-End Video Captioning Based on Multiview Semantic Alignment for Human-Machine Fusion
    Wu, Shuai
    Gao, Yubing
    Yang, Weidong
    Li, Hongkai
    Zhu, Guangyu
    [J]. IEEE TRANSACTIONS ON AUTOMATION SCIENCE AND ENGINEERING, 2024, 22 : 1 - 9
  • [35] Multimodal attention-based transformer for video captioning
    Hemalatha Munusamy
    Chandra Sekhar C
    [J]. Applied Intelligence, 2023, 53 : 23349 - 23368
  • [36] Multimodal graph neural network for video procedural captioning
    Ji, Lei
    Tu, Rongcheng
    Lin, Kevin
    Wang, Lijuan
    Duan, Nan
    [J]. NEUROCOMPUTING, 2022, 488 : 88 - 96
  • [37] Multimodal attention-based transformer for video captioning
    Munusamy, Hemalatha
    Sekhar, C. Chandra
    [J]. APPLIED INTELLIGENCE, 2023, 53 (20) : 23349 - 23368
  • [38] Concept Parser With Multimodal Graph Learning for Video Captioning
    Wu, Bofeng
    Liu, Buyu
    Huang, Peng
    Bao, Jun
    Peng, Xi
    Yu, Jun
    [J]. IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, 2023, 33 (09) : 4484 - 4495
  • [39] Learning Multimodal Attention LSTM Networks for Video Captioning
    Xu, Jun
    Yao, Ting
    Zhang, Yongdong
    Mei, Tao
    [J]. PROCEEDINGS OF THE 2017 ACM MULTIMEDIA CONFERENCE (MM'17), 2017, : 537 - 545
  • [40] Hierarchical Vision-Language Alignment for Video Captioning
    Zhang, Junchao
    Peng, Yuxin
    [J]. MULTIMEDIA MODELING (MMM 2019), PT I, 2019, 11295 : 42 - 54