Multimodal Context Fusion Based Dense Video Captioning Algorithm

被引:0
|
作者
Li, Meiqi [1 ]
Zhou, Ziwei [1 ]
机构
[1] Univ Sci & Technol Liaoning, Sch Comp Sci & Software Engn, Anshan 114051, Peoples R China
关键词
Index Terms; Dense Video Description; Transformer; Mult-imodal feature fusion; Event context; SCN Decoder;
D O I
暂无
中图分类号
T [工业技术];
学科分类号
08 ;
摘要
The core task of dense video description is to identify all events occurring in an unedited video and generate textual descriptions for these events. This has applications in fields such as assisting visually impaired individuals, generating news headlines, and enhancing human-computer interaction. However, existing dense video description models often overlook the role of textual information (e.g., road signs, subtitles) in video comprehension, as well as the contextual relationships between events, which are crucial for accurate description generation. To address these issues, this paper proposes a multimodal dense video description approach based on event-context fusion. The model utilizes a C3D network to extract visual features from the video and integrates OCR technology to extract textual information, thereby enhancing the semantic understanding of the video content. During feature extraction, sliding window and temporal alignment techniques are applied to ensure the temporal consistency of visual, audio, and textual features. A multimodal context fusion encoder is used to capture the temporal and semantic relationships between events and to deeply integrate multimodal features. The SCN decoder then generates descriptions word by word, improving both semantic consistency and fluency. The model is trained and evaluated on the MSVD and MSR-VTT datasets, and its performance is compared with several popular models. Experimental results show significant improvements in CIDEr evaluation scores, achieving 98.8 and 53.7 on the two datasets, respectively. Additionally, ablation studies are conducted to comprehensively assess the effectiveness and stability of each component of the model.
引用
收藏
页码:1061 / 1072
页数:12
相关论文
共 50 条
  • [31] Event-centric multi-modal fusion method for dense video captioning
    Chang, Zhi
    Zhao, Dexin
    Chen, Huilin
    Li, Jingdan
    Liu, Pengfei
    NEURAL NETWORKS, 2022, 146 : 120 - 129
  • [32] Dense Video Captioning Using Graph-Based Sentence Summarization
    Zhang, Zhiwang
    Xu, Dong
    Ouyang, Wanli
    Zhou, Luping
    IEEE TRANSACTIONS ON MULTIMEDIA, 2021, 23 : 1799 - 1810
  • [33] Hierarchical Language Modeling for Dense Video Captioning
    Dave, Jaivik
    Padmavathi, S.
    INVENTIVE COMPUTATION AND INFORMATION TECHNOLOGIES, ICICIT 2021, 2022, 336 : 421 - 431
  • [34] Accelerated masked transformer for dense video captioning
    Yu, Zhou
    Han, Nanjia
    NEUROCOMPUTING, 2021, 445 : 72 - 80
  • [35] Multi-modal Dense Video Captioning
    Iashin, Vladimir
    Rahtu, Esa
    2020 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION WORKSHOPS (CVPRW 2020), 2020, : 4117 - 4126
  • [36] TopicDVC: Dense Video Captioning with Topic Guidance
    Chen, Wei
    2024 IEEE 10TH INTERNATIONAL CONFERENCE ON EDGE COMPUTING AND SCALABLE CLOUD, EDGECOM 2024, 2024, : 82 - 87
  • [37] Video Captioning with Guidance of Multimodal Latent Topics
    Chen, Shizhe
    Chen, Jia
    Jin, Qin
    Hauptmann, Alexander
    PROCEEDINGS OF THE 2017 ACM MULTIMEDIA CONFERENCE (MM'17), 2017, : 1838 - 1846
  • [38] MULTIMODAL SEMANTIC ATTENTION NETWORK FOR VIDEO CAPTIONING
    Sun, Liang
    Li, Bing
    Yuan, Chunfeng
    Zha, Zhengjun
    Hu, Weiming
    2019 IEEE INTERNATIONAL CONFERENCE ON MULTIMEDIA AND EXPO (ICME), 2019, : 1300 - 1305
  • [39] Video Captioning based on Multi-feature Fusion with Object
    Zhou, Lijuan
    Liu, Tao
    Niu, Changyong
    THIRTEENTH INTERNATIONAL CONFERENCE ON DIGITAL IMAGE PROCESSING (ICDIP 2021), 2021, 11878
  • [40] Regular Constrained Multimodal Fusion for Image Captioning
    Wang, Liya
    Chen, Haipeng
    Liu, Yu
    Lyu, Yingda
    IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, 2024, 34 (11) : 11900 - 11913