Event-centric multi-modal fusion method for dense video captioning

被引：13

作者：

Chang, Zhi ^{[1
]}

Zhao, Dexin ^{[1
,2
]}

Chen, Huilin ^{[1
]}

Li, Jingdan ^{[1
]}

Liu, Pengfei ^{[1
]}

机构：

[1] Tianjin Univ Technol, Tianjin Key Lab Intelligence Comp & Novel Softwar, Tianjin 300384, Peoples R China

[2] Minist Educ, Key Lab Comp Vis & Syst, Beijing, Peoples R China

来源：

NEURAL NETWORKS | 2022年 / 146卷

关键词：

Dense video captioning; Event-centric; Multi-modal fusion;

D O I：

10.1016/j.neunet.2021.11.017

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

Dense video captioning aims to automatically describe several events that occur in a given video, which most state-of-the-art models accomplish by locating and describing multiple events in an untrimmed video. Despite much progress in this area, most current approaches only encode visual features in the event location phase and they neglect the relationships between events, which may degrade the consistency of the description in the identical video. Thus, in the present study, we attempted to exploit visual-audio cues to generate event proposals and enhance event-level representations by capturing their temporal and semantic relationships. Furthermore, to compensate for the major limitation of not fully utilizing multimodal information in the description process, we developed an attention-gating mechanism that dynamically fuses and regulates the multi-modal information. In summary, we propose an event-centric multi-modal fusion approach for dense video captioning (EMVC) to capture the relationships between events and effectively fuse multi-modal information. We conducted comprehensive experiments to evaluate the performance of EMVC based on the benchmark ActivityNet Caption and YouCook2 data sets. The experimental results showed that our model achieved impressive performance compared with state-of-the-art methods. (C) 2021 Elsevier Ltd. All rights reserved.

引用

页码：120 / 129

页数：10

共 50 条

[1] Event-centric Multi-modal Fusion Method for Dense Video Captioning (vol 146, pg 120, 2022)
Chang, Zhi
Zhao, Dexin
Chen, Huilin
Li, Jingdan
Liu, Pengfei
[J]. NEURAL NETWORKS, 2022, 152 : 527 - 527
[2] Event-Centric Hierarchical Representation for Dense Video Captioning
Wang, Teng
Zheng, Huicheng
Yu, Mingjing
Tian, Qian
Hu, Haifeng
[J]. IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, 2021, 31 (05) : 1890 - 1900
[3] Multi-modal Dense Video Captioning
Iashin, Vladimir
Rahtu, Esa
[J]. 2020 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION WORKSHOPS (CVPRW 2020), 2020, : 4117 - 4126
[4] MULTI-MODAL HIERARCHICAL ATTENTION-BASED DENSE VIDEO CAPTIONING
Munusamy, Hemalatha
Sekhar, Chandra C.
[J]. 2023 IEEE INTERNATIONAL CONFERENCE ON IMAGE PROCESSING, ICIP, 2023, : 475 - 479
[5] Watch, Listen and Tell: Multi-modal Weakly Supervised Dense Event Captioning
Rahman, Tanzila
Xu, Bicheng
Sigal, Leonid
[J]. 2019 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2019), 2019, : 8907 - 8916
[6] Multi-modal Dependency Tree for Video Captioning
Zhao, Wentian
Wu, Xinxiao
Luo, Jiebo
[J]. ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 34 (NEURIPS 2021), 2021, 34
[7] Fusion of Multi-Modal Features to Enhance Dense Video Caption
Huang, Xuefei
Chan, Ka-Hou
Wu, Weifan
Sheng, Hao
Ke, Wei
[J]. SENSORS, 2023, 23 (12)
[8] MRCap: Multi-modal and Multi-level Relationship-based Dense Video Captioning
Chen, Wei
Niu, Jianwei
Liu, Xuefeng
[J]. 2023 IEEE INTERNATIONAL CONFERENCE ON MULTIMEDIA AND EXPO, ICME, 2023, : 2615 - 2620
[9] Video summarization for event-centric videos
Li, Qingwen
Chen, Jianni
Xie, Qiqin
Han, Xiao
[J]. NEURAL NETWORKS, 2023, 161 : 359 - 370
[10] Global-Shared Text Representation Based Multi-Stage Fusion Transformer Network for Multi-Modal Dense Video Captioning
Xie, Yulai
Niu, Jingjing
Zhang, Yang
Ren, Fang
[J]. IEEE TRANSACTIONS ON MULTIMEDIA, 2024, 26 : 3164 - 3179

← 1 2 3 4 5 →