MRCap: Multi-modal and Multi-level Relationship-based Dense Video Captioning

被引:0
|
作者
Chen, Wei [1 ]
Niu, Jianwei [1 ,2 ,3 ]
Liu, Xuefeng [1 ,2 ]
机构
[1] Beihang Univ, Sch Comp Sci & Engn, State Key Lab Virtual Real Technol & Syst, Beijing, Peoples R China
[2] Zhongguancun Lab, Beijing, Peoples R China
[3] Zhengzhou Univ, Res Inst Ind Technol, Sch Informat Engn, Zhengzhou, Peoples R China
基金
中国国家自然科学基金;
关键词
Dense video captioning; event; multi-modal and multi-level; relationship;
D O I
10.1109/ICME55011.2023.00445
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Dense video captioning, with the objective of describing a sequence of events in a video, has received much attention recently. As events in a video are highly correlated, leveraging relationships among events helps generate coherent captions. To utilize relationships among events, existing methods mainly enrich event representations with their context, either in the form of vision (i.e., video segments) or combining vision and language (i.e., captions). However, these methods do not explicitly exploit the correspondence between these two modalities. Moreover, the video-level context spanning multiple events is not fully exploited. In this paper, we propose MRCap, a novel relationship-based model for dense video captioning. The key of MRCap is a multi-modal and multi-level event relationship module (MMERM). MMERM exploits the correspondence between vision and language at both the event level and the video level via contrastive learning. Experiments on ActivityNet Captions and YouCook2 datasets demonstrate that MRCap achieves state-ofthe-art performance.
引用
收藏
页码:2615 / 2620
页数:6
相关论文
共 50 条
  • [1] Multi-modal Dense Video Captioning
    Iashin, Vladimir
    Rahtu, Esa
    2020 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION WORKSHOPS (CVPRW 2020), 2020, : 4117 - 4126
  • [2] MULTI-MODAL HIERARCHICAL ATTENTION-BASED DENSE VIDEO CAPTIONING
    Munusamy, Hemalatha
    Sekhar, Chandra C.
    2023 IEEE INTERNATIONAL CONFERENCE ON IMAGE PROCESSING, ICIP, 2023, : 475 - 479
  • [3] Multi-modal Dependency Tree for Video Captioning
    Zhao, Wentian
    Wu, Xinxiao
    Luo, Jiebo
    ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 34 (NEURIPS 2021), 2021, 34
  • [4] Event-centric multi-modal fusion method for dense video captioning
    Chang, Zhi
    Zhao, Dexin
    Chen, Huilin
    Li, Jingdan
    Liu, Pengfei
    NEURAL NETWORKS, 2022, 146 : 120 - 129
  • [5] Multi-level video captioning method based on semantic space
    Yao, Xiao
    Zeng, Yuanlin
    Gu, Min
    Yuan, Ruxi
    Li, Jie
    Ge, Junyi
    MULTIMEDIA TOOLS AND APPLICATIONS, 2024, 83 (28) : 72113 - 72130
  • [6] Multi-level and Multi-modal Target Detection Based on Feature Fusion
    Cheng T.
    Sun L.
    Hou D.
    Shi Q.
    Zhang J.
    Chen J.
    Huang H.
    Qiche Gongcheng/Automotive Engineering, 2021, 43 (11): : 1602 - 1610
  • [7] CMGNet: Collaborative multi-modal graph network for video captioning
    Rao, Qi
    Yu, Xin
    Li, Guang
    Zhu, Linchao
    COMPUTER VISION AND IMAGE UNDERSTANDING, 2024, 238
  • [8] Global-Shared Text Representation Based Multi-Stage Fusion Transformer Network for Multi-Modal Dense Video Captioning
    Xie, Yulai
    Niu, Jingjing
    Zhang, Yang
    Ren, Fang
    IEEE TRANSACTIONS ON MULTIMEDIA, 2024, 26 : 3164 - 3179
  • [9] Multi-level Interaction Network for Multi-Modal Rumor Detection
    Zou, Ting
    Qian, Zhong
    Li, Peifeng
    2023 INTERNATIONAL JOINT CONFERENCE ON NEURAL NETWORKS, IJCNN, 2023,
  • [10] Multi-Modal fusion with multi-level attention for Visual Dialog
    Zhang, Jingping
    Wang, Qiang
    Han, Yahong
    INFORMATION PROCESSING & MANAGEMENT, 2020, 57 (04)