MRCap: Multi-modal and Multi-level Relationship-based Dense Video Captioning

被引:0
|
作者
Chen, Wei [1 ]
Niu, Jianwei [1 ,2 ,3 ]
Liu, Xuefeng [1 ,2 ]
机构
[1] Beihang Univ, Sch Comp Sci & Engn, State Key Lab Virtual Real Technol & Syst, Beijing, Peoples R China
[2] Zhongguancun Lab, Beijing, Peoples R China
[3] Zhengzhou Univ, Res Inst Ind Technol, Sch Informat Engn, Zhengzhou, Peoples R China
基金
中国国家自然科学基金;
关键词
Dense video captioning; event; multi-modal and multi-level; relationship;
D O I
10.1109/ICME55011.2023.00445
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Dense video captioning, with the objective of describing a sequence of events in a video, has received much attention recently. As events in a video are highly correlated, leveraging relationships among events helps generate coherent captions. To utilize relationships among events, existing methods mainly enrich event representations with their context, either in the form of vision (i.e., video segments) or combining vision and language (i.e., captions). However, these methods do not explicitly exploit the correspondence between these two modalities. Moreover, the video-level context spanning multiple events is not fully exploited. In this paper, we propose MRCap, a novel relationship-based model for dense video captioning. The key of MRCap is a multi-modal and multi-level event relationship module (MMERM). MMERM exploits the correspondence between vision and language at both the event level and the video level via contrastive learning. Experiments on ActivityNet Captions and YouCook2 datasets demonstrate that MRCap achieves state-ofthe-art performance.
引用
收藏
页码:2615 / 2620
页数:6
相关论文
共 50 条
  • [31] Multi-modal Video Summarization
    Huang, Jia-Hong
    PROCEEDINGS OF THE 4TH ANNUAL ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA RETRIEVAL, ICMR 2024, 2024, : 1214 - 1218
  • [32] Unsupervised domain adaptation multi-level adversarial network for semantic segmentation based on multi-modal features
    Wang Z.
    Bu S.
    Huang W.
    Zheng Y.
    Wu Q.
    Chang H.
    Zhang X.
    Tongxin Xuebao/Journal on Communications, 2022, 43 (12): : 157 - 171
  • [33] Multi-dimension and multi-modal rolling mill vibration prediction model based on multi-level network fusion
    Chen, Shu-zong
    Liu, Yun-xiao
    Wang, Yun-long
    Qian, Cheng
    Hua, Chang-chun
    Sun, Jie
    JOURNAL OF CENTRAL SOUTH UNIVERSITY, 2024, : 3329 - 3348
  • [34] Multi-level, multi-modal interactions for visual question answering over text in images
    Chen, Jincai
    Zhang, Sheng
    Zeng, Jiangfeng
    Zou, Fuhao
    Li, Yuan-Fang
    Liu, Tao
    Lu, Ping
    World Wide Web, 2022, 25 (04) : 1607 - 1623
  • [35] Complex Multi-modal Multi-level Influence Networks - Affordable Housing Case Study
    Beautement, Patrick
    Broenner, Christine
    COMPLEX SCIENCES, PT 2, 2009, 5 : 2054 - 2063
  • [36] Multi-Level Multi-Modal Cross-Attention Network for Fake News Detection
    Ying, Long
    Yu, Hui
    Wang, Jinguang
    Ji, Yongze
    Qian, Shengsheng
    IEEE ACCESS, 2021, 9 : 132363 - 132373
  • [37] Explanation as a Process: User-Centric Construction of Multi-level and Multi-modal Explanations
    Finzel, Bettina
    Tafler, David E.
    Scheele, Stephan
    Schmid, Ute
    ADVANCES IN ARTIFICIAL INTELLIGENCE, KI 2021, 2021, 12873 : 80 - 94
  • [38] Multi-level, multi-modal interactions for visual question answering over text in images
    Jincai Chen
    Sheng Zhang
    Jiangfeng Zeng
    Fuhao Zou
    Yuan-Fang Li
    Tao Liu
    Ping Lu
    World Wide Web, 2022, 25 : 1607 - 1623
  • [39] Multi-level, multi-modal interactions for visual question answering over text in images
    Chen, Jincai
    Zhang, Sheng
    Zeng, Jiangfeng
    Zou, Fuhao
    Li, Yuan-Fang
    Liu, Tao
    Lu, Ping
    WORLD WIDE WEB-INTERNET AND WEB INFORMATION SYSTEMS, 2022, 25 (04): : 1607 - 1623
  • [40] MLSFF: Multi-level structural features fusion for multi-modal knowledge graph completion
    Zhai, Hanming
    Lv, Xiaojun
    Hou, Zhiwen
    Tong, Xin
    Bu, Fanliang
    MATHEMATICAL BIOSCIENCES AND ENGINEERING, 2023, 20 (08) : 14096 - 14116