MRCap: Multi-modal and Multi-level Relationship-based Dense Video Captioning

被引:0
|
作者
Chen, Wei [1 ]
Niu, Jianwei [1 ,2 ,3 ]
Liu, Xuefeng [1 ,2 ]
机构
[1] Beihang Univ, Sch Comp Sci & Engn, State Key Lab Virtual Real Technol & Syst, Beijing, Peoples R China
[2] Zhongguancun Lab, Beijing, Peoples R China
[3] Zhengzhou Univ, Res Inst Ind Technol, Sch Informat Engn, Zhengzhou, Peoples R China
基金
中国国家自然科学基金;
关键词
Dense video captioning; event; multi-modal and multi-level; relationship;
D O I
10.1109/ICME55011.2023.00445
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Dense video captioning, with the objective of describing a sequence of events in a video, has received much attention recently. As events in a video are highly correlated, leveraging relationships among events helps generate coherent captions. To utilize relationships among events, existing methods mainly enrich event representations with their context, either in the form of vision (i.e., video segments) or combining vision and language (i.e., captions). However, these methods do not explicitly exploit the correspondence between these two modalities. Moreover, the video-level context spanning multiple events is not fully exploited. In this paper, we propose MRCap, a novel relationship-based model for dense video captioning. The key of MRCap is a multi-modal and multi-level event relationship module (MMERM). MMERM exploits the correspondence between vision and language at both the event level and the video level via contrastive learning. Experiments on ActivityNet Captions and YouCook2 datasets demonstrate that MRCap achieves state-ofthe-art performance.
引用
收藏
页码:2615 / 2620
页数:6
相关论文
共 50 条
  • [41] MLMFNet: A multi-level modality fusion network for multi-modal accelerated MRI reconstruction
    Zhou, Xiuyun
    Zhang, Zhenxi
    Du, Hongwei
    Qiu, Bensheng
    MAGNETIC RESONANCE IMAGING, 2024, 111 : 246 - 255
  • [42] A robust multi-level sparse classifier with multi-modal feature extraction for face recognition
    Vishwakarma, Virendra P.
    Mishra, Gargi
    INTERNATIONAL JOURNAL OF APPLIED PATTERN RECOGNITION, 2019, 6 (01) : 76 - 102
  • [43] Multi-Level Cross-Modal Interactive-Network-Based Semi-Supervised Multi-Modal Ship Classification
    The School of Software Technology, Dalian University of Technology, Dalian
    116621, China
    Sensors, 2024, 22
  • [44] Multi-Modal Multi-Action Video Recognition
    Shi, Zhensheng
    Liang, Ju
    Li, Qianqian
    Zheng, Haiyong
    Gu, Zhaorui
    Dong, Junyu
    Zheng, Bing
    2021 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2021), 2021, : 13658 - 13667
  • [45] Multi-Level Visual Representation with Semantic-Reinforced Learning for Video Captioning
    Dong, Chengbo
    Chen, Xinru
    Chen, Aozhu
    Hu, Fan
    Wang, Zihan
    Li, Xirong
    PROCEEDINGS OF THE 29TH ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA, MM 2021, 2021, : 4750 - 4754
  • [46] Multi-modal fusion for video understanding
    Hoogs, A
    Mundy, J
    Cross, G
    30TH APPLIED IMAGERY PATTERN RECOGNITION WORKSHOP, PROCEEDINGS: ANALYSIS AND UNDERSTANDING OF TIME VARYING IMAGERY, 2001, : 103 - 108
  • [47] Multi-modal multi-view video coding based on correlation analysis
    Jiang, Gang-Yi
    Zhang, Yun
    Yu, Mei
    Jisuanji Xuebao/Chinese Journal of Computers, 2007, 30 (12): : 2205 - 2211
  • [48] Multi-level fusion network for mild cognitive impairment identification using multi-modal neuroimages
    Xu, Haozhe
    Zhong, Shengzhou
    Zhang, Yu
    PHYSICS IN MEDICINE AND BIOLOGY, 2023, 68 (09):
  • [49] Multi-level Deep Correlative Networks for Multi-modal Sentiment AnalysisInspec keywordsOther keywordsKey words
    Cai, Guoyong
    Lyu, Guangrui
    Lin, Yuming
    Wen, Yimin
    CHINESE JOURNAL OF ELECTRONICS, 2020, 29 (06) : 1025 - 1038
  • [50] Contextualized Keyword Representations for Multi-modal Retinal Image Captioning
    Huang, Jia-Hong
    Wu, Ting-Wei
    Worring, Marcel
    PROCEEDINGS OF THE 2021 INTERNATIONAL CONFERENCE ON MULTIMEDIA RETRIEVAL (ICMR '21), 2021, : 645 - 652