Multimodal Context Fusion Based Dense Video Captioning Algorithm

被引:0
|
作者
Li, Meiqi [1 ]
Zhou, Ziwei [1 ]
机构
[1] Univ Sci & Technol Liaoning, Sch Comp Sci & Software Engn, Anshan 114051, Peoples R China
关键词
Index Terms; Dense Video Description; Transformer; Mult-imodal feature fusion; Event context; SCN Decoder;
D O I
暂无
中图分类号
T [工业技术];
学科分类号
08 ;
摘要
The core task of dense video description is to identify all events occurring in an unedited video and generate textual descriptions for these events. This has applications in fields such as assisting visually impaired individuals, generating news headlines, and enhancing human-computer interaction. However, existing dense video description models often overlook the role of textual information (e.g., road signs, subtitles) in video comprehension, as well as the contextual relationships between events, which are crucial for accurate description generation. To address these issues, this paper proposes a multimodal dense video description approach based on event-context fusion. The model utilizes a C3D network to extract visual features from the video and integrates OCR technology to extract textual information, thereby enhancing the semantic understanding of the video content. During feature extraction, sliding window and temporal alignment techniques are applied to ensure the temporal consistency of visual, audio, and textual features. A multimodal context fusion encoder is used to capture the temporal and semantic relationships between events and to deeply integrate multimodal features. The SCN decoder then generates descriptions word by word, improving both semantic consistency and fluency. The model is trained and evaluated on the MSVD and MSR-VTT datasets, and its performance is compared with several popular models. Experimental results show significant improvements in CIDEr evaluation scores, achieving 98.8 and 53.7 on the two datasets, respectively. Additionally, ablation studies are conducted to comprehensively assess the effectiveness and stability of each component of the model.
引用
收藏
页码:1061 / 1072
页数:12
相关论文
共 50 条
  • [41] Dense Captioning with Joint Inference and Visual Context
    Yang, Linjie
    Tang, Kevin
    Yang, Jianchao
    Li, Li-Jia
    30TH IEEE CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2017), 2017, : 1978 - 1987
  • [42] Character emotion recognition algorithm in small sample video based on multimodal feature fusion
    Xie, Jian
    Chu, Dan
    INTERNATIONAL JOURNAL OF BIOMETRICS, 2025, 17 (1-2) : 1 - 14
  • [43] MIVCN: Multimodal interaction video captioning network based on semantic association graph
    Ying Wang
    Guoheng Huang
    Lin Yuming
    Haoliang Yuan
    Chi-Man Pun
    Wing-Kuen Ling
    Lianglun Cheng
    Applied Intelligence, 2022, 52 : 5241 - 5260
  • [44] Video captioning algorithm based on mixed training and semantic association
    Chen, Shuqin
    Zhong, Xian
    Huang, Wenxin
    Lu, Yansheng
    Huazhong Keji Daxue Xuebao (Ziran Kexue Ban)/Journal of Huazhong University of Science and Technology (Natural Science Edition), 2023, 51 (11): : 67 - 74
  • [45] MIVCN: Multimodal interaction video captioning network based on semantic association graph
    Wang, Ying
    Huang, Guoheng
    Lin Yuming
    Yuan, Haoliang
    Pun, Chi-Man
    Ling, Wing-Kuen
    Cheng, Lianglun
    APPLIED INTELLIGENCE, 2022, 52 (05) : 5241 - 5260
  • [46] MULTI-MODAL HIERARCHICAL ATTENTION-BASED DENSE VIDEO CAPTIONING
    Munusamy, Hemalatha
    Sekhar, Chandra C.
    2023 IEEE INTERNATIONAL CONFERENCE ON IMAGE PROCESSING, ICIP, 2023, : 475 - 479
  • [47] Jointly Localizing and Describing Events for Dense Video Captioning
    Li, Yehao
    Yao, Ting
    Pan, Yingwei
    Chao, Hongyang
    Mei, Tao
    2018 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2018, : 7492 - 7500
  • [48] Multimodal graph neural network for video procedural captioning
    Ji, Lei
    Tu, Rongcheng
    Lin, Kevin
    Wang, Lijuan
    Duan, Nan
    NEUROCOMPUTING, 2022, 488 : 88 - 96
  • [49] Context Visual Information-based Deliberation Network for Video Captioning
    Lu, Min
    Li, Xueyong
    Liu, Caihua
    2020 25TH INTERNATIONAL CONFERENCE ON PATTERN RECOGNITION (ICPR), 2021, : 9812 - 9818
  • [50] Step by Step: A Gradual Approach for Dense Video Captioning
    Choi, Wangyu
    Chen, Jiasi
    Yoon, Jongwon
    IEEE ACCESS, 2023, 11 : 51949 - 51959