Event-centric multi-modal fusion method for dense video captioning

被引:13
|
作者
Chang, Zhi [1 ]
Zhao, Dexin [1 ,2 ]
Chen, Huilin [1 ]
Li, Jingdan [1 ]
Liu, Pengfei [1 ]
机构
[1] Tianjin Univ Technol, Tianjin Key Lab Intelligence Comp & Novel Softwar, Tianjin 300384, Peoples R China
[2] Minist Educ, Key Lab Comp Vis & Syst, Beijing, Peoples R China
关键词
Dense video captioning; Event-centric; Multi-modal fusion;
D O I
10.1016/j.neunet.2021.11.017
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Dense video captioning aims to automatically describe several events that occur in a given video, which most state-of-the-art models accomplish by locating and describing multiple events in an untrimmed video. Despite much progress in this area, most current approaches only encode visual features in the event location phase and they neglect the relationships between events, which may degrade the consistency of the description in the identical video. Thus, in the present study, we attempted to exploit visual-audio cues to generate event proposals and enhance event-level representations by capturing their temporal and semantic relationships. Furthermore, to compensate for the major limitation of not fully utilizing multimodal information in the description process, we developed an attention-gating mechanism that dynamically fuses and regulates the multi-modal information. In summary, we propose an event-centric multi-modal fusion approach for dense video captioning (EMVC) to capture the relationships between events and effectively fuse multi-modal information. We conducted comprehensive experiments to evaluate the performance of EMVC based on the benchmark ActivityNet Caption and YouCook2 data sets. The experimental results showed that our model achieved impressive performance compared with state-of-the-art methods. (C) 2021 Elsevier Ltd. All rights reserved.
引用
收藏
页码:120 / 129
页数:10
相关论文
共 50 条
  • [1] Event-centric Multi-modal Fusion Method for Dense Video Captioning (vol 146, pg 120, 2022)
    Chang, Zhi
    Zhao, Dexin
    Chen, Huilin
    Li, Jingdan
    Liu, Pengfei
    [J]. NEURAL NETWORKS, 2022, 152 : 527 - 527
  • [2] Event-Centric Hierarchical Representation for Dense Video Captioning
    Wang, Teng
    Zheng, Huicheng
    Yu, Mingjing
    Tian, Qian
    Hu, Haifeng
    [J]. IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, 2021, 31 (05) : 1890 - 1900
  • [3] Multi-modal Dense Video Captioning
    Iashin, Vladimir
    Rahtu, Esa
    [J]. 2020 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION WORKSHOPS (CVPRW 2020), 2020, : 4117 - 4126
  • [4] MULTI-MODAL HIERARCHICAL ATTENTION-BASED DENSE VIDEO CAPTIONING
    Munusamy, Hemalatha
    Sekhar, Chandra C.
    [J]. 2023 IEEE INTERNATIONAL CONFERENCE ON IMAGE PROCESSING, ICIP, 2023, : 475 - 479
  • [5] Watch, Listen and Tell: Multi-modal Weakly Supervised Dense Event Captioning
    Rahman, Tanzila
    Xu, Bicheng
    Sigal, Leonid
    [J]. 2019 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2019), 2019, : 8907 - 8916
  • [6] Multi-modal Dependency Tree for Video Captioning
    Zhao, Wentian
    Wu, Xinxiao
    Luo, Jiebo
    [J]. ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 34 (NEURIPS 2021), 2021, 34
  • [7] Fusion of Multi-Modal Features to Enhance Dense Video Caption
    Huang, Xuefei
    Chan, Ka-Hou
    Wu, Weifan
    Sheng, Hao
    Ke, Wei
    [J]. SENSORS, 2023, 23 (12)
  • [8] MRCap: Multi-modal and Multi-level Relationship-based Dense Video Captioning
    Chen, Wei
    Niu, Jianwei
    Liu, Xuefeng
    [J]. 2023 IEEE INTERNATIONAL CONFERENCE ON MULTIMEDIA AND EXPO, ICME, 2023, : 2615 - 2620
  • [9] Video summarization for event-centric videos
    Li, Qingwen
    Chen, Jianni
    Xie, Qiqin
    Han, Xiao
    [J]. NEURAL NETWORKS, 2023, 161 : 359 - 370
  • [10] Global-Shared Text Representation Based Multi-Stage Fusion Transformer Network for Multi-Modal Dense Video Captioning
    Xie, Yulai
    Niu, Jingjing
    Zhang, Yang
    Ren, Fang
    [J]. IEEE TRANSACTIONS ON MULTIMEDIA, 2024, 26 : 3164 - 3179