MRCap: Multi-modal and Multi-level Relationship-based Dense Video Captioning

被引：0

作者：

Chen, Wei ^{[1
]}

Niu, Jianwei ^{[1
,2
,3
]}

Liu, Xuefeng ^{[1
,2
]}

机构：

[1] Beihang Univ, Sch Comp Sci & Engn, State Key Lab Virtual Real Technol & Syst, Beijing, Peoples R China

[2] Zhongguancun Lab, Beijing, Peoples R China

[3] Zhengzhou Univ, Res Inst Ind Technol, Sch Informat Engn, Zhengzhou, Peoples R China

来源：

2023 IEEE INTERNATIONAL CONFERENCE ON MULTIMEDIA AND EXPO, ICME | 2023年

基金：

中国国家自然科学基金;

关键词：

Dense video captioning; event; multi-modal and multi-level; relationship;

D O I：

10.1109/ICME55011.2023.00445

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

Dense video captioning, with the objective of describing a sequence of events in a video, has received much attention recently. As events in a video are highly correlated, leveraging relationships among events helps generate coherent captions. To utilize relationships among events, existing methods mainly enrich event representations with their context, either in the form of vision (i.e., video segments) or combining vision and language (i.e., captions). However, these methods do not explicitly exploit the correspondence between these two modalities. Moreover, the video-level context spanning multiple events is not fully exploited. In this paper, we propose MRCap, a novel relationship-based model for dense video captioning. The key of MRCap is a multi-modal and multi-level event relationship module (MMERM). MMERM exploits the correspondence between vision and language at both the event level and the video level via contrastive learning. Experiments on ActivityNet Captions and YouCook2 datasets demonstrate that MRCap achieves state-ofthe-art performance.

引用

页码：2615 / 2620

页数：6

共 50 条

[21] Fusion of Multi-Modal Features to Enhance Dense Video Caption
Huang, Xuefei
Chan, Ka-Hou
Wu, Weifan
Sheng, Hao
Ke, Wei
SENSORS, 2023, 23 (12)
[22] An efficient deep learning-based video captioning framework using multi-modal features
Varma, Soumya
James, Dinesh Peter
EXPERT SYSTEMS, 2021,
[23] Multi-Modal Image Captioning for the Visually Impaired
Ahsan, Hiba
Bhalla, Nikita
Bhatt, Daivat
Shah, Kaivankumar
2021 CONFERENCE OF THE NORTH AMERICAN CHAPTER OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS: HUMAN LANGUAGE TECHNOLOGIES (NAACL-HLT 2021), 2021, : 53 - 60
[24] Multi-modal brain image fusion based on multi-level edge-preserving filtering
Tan, Wei
Thiton, William
Xiang, Pei
Zhou, Huixin
BIOMEDICAL SIGNAL PROCESSING AND CONTROL, 2021, 64
[25] Event-centric Multi-modal Fusion Method for Dense Video Captioning (vol 146, pg 120, 2022)
Chang, Zhi
Zhao, Dexin
Chen, Huilin
Li, Jingdan
Liu, Pengfei
NEURAL NETWORKS, 2022, 152 : 527 - 527
[26] Dense Relational Captioning: Triple-Stream Networks for Relationship-Based Captioning
Kim, Dong-Jin
Choi, Jinsoo
Oh, Tae-Hyun
Kweon, In So
2019 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2019), 2019, : 6264 - 6273
[27] Multi-level Fusion of Multi-modal Semantic Embeddings for Zero Shot Learning
Kong, Zhe
Wang, Xin
Gao, Neng
Zhang, Yifei
Liu, Yuhan
Tu, Chenyang
PROCEEDINGS OF THE 2022 INTERNATIONAL CONFERENCE ON MULTIMODAL INTERACTION, ICMI 2022, 2022, : 310 - 318
[28] Multi-level Video Captioning based on Label Classification using Machine Learning Techniques
Vaishnavi, J.
Narmatha, V.
INTERNATIONAL JOURNAL OF ADVANCED COMPUTER SCIENCE AND APPLICATIONS, 2022, 13 (11) : 582 - 588
[29] MAM-RNN: Multi-level Attention Model Based RNN for Video Captioning
Li, Xuelong
Zhao, Bin
Lu, Xiaoqiang
PROCEEDINGS OF THE TWENTY-SIXTH INTERNATIONAL JOINT CONFERENCE ON ARTIFICIAL INTELLIGENCE, 2017, : 2208 - 2214
[30] Multi-modal Video Summarization
Huang, Jia-Hong
ICMR 2024 - Proceedings of the 2024 International Conference on Multimedia Retrieval, 2024, : 1214 - 1218

← 1 2 3 4 5 →