MRCap: Multi-modal and Multi-level Relationship-based Dense Video Captioning

被引:0
|
作者
Chen, Wei [1 ]
Niu, Jianwei [1 ,2 ,3 ]
Liu, Xuefeng [1 ,2 ]
机构
[1] Beihang Univ, Sch Comp Sci & Engn, State Key Lab Virtual Real Technol & Syst, Beijing, Peoples R China
[2] Zhongguancun Lab, Beijing, Peoples R China
[3] Zhengzhou Univ, Res Inst Ind Technol, Sch Informat Engn, Zhengzhou, Peoples R China
基金
中国国家自然科学基金;
关键词
Dense video captioning; event; multi-modal and multi-level; relationship;
D O I
10.1109/ICME55011.2023.00445
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Dense video captioning, with the objective of describing a sequence of events in a video, has received much attention recently. As events in a video are highly correlated, leveraging relationships among events helps generate coherent captions. To utilize relationships among events, existing methods mainly enrich event representations with their context, either in the form of vision (i.e., video segments) or combining vision and language (i.e., captions). However, these methods do not explicitly exploit the correspondence between these two modalities. Moreover, the video-level context spanning multiple events is not fully exploited. In this paper, we propose MRCap, a novel relationship-based model for dense video captioning. The key of MRCap is a multi-modal and multi-level event relationship module (MMERM). MMERM exploits the correspondence between vision and language at both the event level and the video level via contrastive learning. Experiments on ActivityNet Captions and YouCook2 datasets demonstrate that MRCap achieves state-ofthe-art performance.
引用
收藏
页码:2615 / 2620
页数:6
相关论文
共 50 条
  • [21] Fusion of Multi-Modal Features to Enhance Dense Video Caption
    Huang, Xuefei
    Chan, Ka-Hou
    Wu, Weifan
    Sheng, Hao
    Ke, Wei
    SENSORS, 2023, 23 (12)
  • [22] An efficient deep learning-based video captioning framework using multi-modal features
    Varma, Soumya
    James, Dinesh Peter
    EXPERT SYSTEMS, 2021,
  • [23] Multi-Modal Image Captioning for the Visually Impaired
    Ahsan, Hiba
    Bhalla, Nikita
    Bhatt, Daivat
    Shah, Kaivankumar
    2021 CONFERENCE OF THE NORTH AMERICAN CHAPTER OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS: HUMAN LANGUAGE TECHNOLOGIES (NAACL-HLT 2021), 2021, : 53 - 60
  • [24] Multi-modal brain image fusion based on multi-level edge-preserving filtering
    Tan, Wei
    Thiton, William
    Xiang, Pei
    Zhou, Huixin
    BIOMEDICAL SIGNAL PROCESSING AND CONTROL, 2021, 64
  • [25] Event-centric Multi-modal Fusion Method for Dense Video Captioning (vol 146, pg 120, 2022)
    Chang, Zhi
    Zhao, Dexin
    Chen, Huilin
    Li, Jingdan
    Liu, Pengfei
    NEURAL NETWORKS, 2022, 152 : 527 - 527
  • [26] Dense Relational Captioning: Triple-Stream Networks for Relationship-Based Captioning
    Kim, Dong-Jin
    Choi, Jinsoo
    Oh, Tae-Hyun
    Kweon, In So
    2019 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2019), 2019, : 6264 - 6273
  • [27] Multi-level Fusion of Multi-modal Semantic Embeddings for Zero Shot Learning
    Kong, Zhe
    Wang, Xin
    Gao, Neng
    Zhang, Yifei
    Liu, Yuhan
    Tu, Chenyang
    PROCEEDINGS OF THE 2022 INTERNATIONAL CONFERENCE ON MULTIMODAL INTERACTION, ICMI 2022, 2022, : 310 - 318
  • [28] Multi-level Video Captioning based on Label Classification using Machine Learning Techniques
    Vaishnavi, J.
    Narmatha, V.
    INTERNATIONAL JOURNAL OF ADVANCED COMPUTER SCIENCE AND APPLICATIONS, 2022, 13 (11) : 582 - 588
  • [29] MAM-RNN: Multi-level Attention Model Based RNN for Video Captioning
    Li, Xuelong
    Zhao, Bin
    Lu, Xiaoqiang
    PROCEEDINGS OF THE TWENTY-SIXTH INTERNATIONAL JOINT CONFERENCE ON ARTIFICIAL INTELLIGENCE, 2017, : 2208 - 2214
  • [30] Multi-modal Video Summarization
    Huang, Jia-Hong
    ICMR 2024 - Proceedings of the 2024 International Conference on Multimedia Retrieval, 2024, : 1214 - 1218