Fusion of Multi-Modal Features to Enhance Dense Video Caption

被引:3
|
作者
Huang, Xuefei [1 ]
Chan, Ka-Hou [1 ,2 ]
Wu, Weifan [1 ]
Sheng, Hao [1 ,3 ,4 ]
Ke, Wei [1 ,2 ]
机构
[1] Macao Polytech Univ, Fac Appl Sci, Macau 999078, Peoples R China
[2] Macao Polytech Univ, Engn Res Ctr Appl Technol Machine Translat & Artif, Minist Educ, Macau 999078, Peoples R China
[3] Beihang Univ, Sch Comp Sci & Engn, State Key Lab Virtual Real Technol & Syst, Beijing 100191, Peoples R China
[4] Beihang Hangzhou Innovat Inst Yuhang, Hangzhou 310023, Peoples R China
基金
中国国家自然科学基金;
关键词
dense video caption; video captioning; multi-modal feature fusion; feature extraction; neural network; TRACKING;
D O I
10.3390/s23125565
中图分类号
O65 [分析化学];
学科分类号
070302 ; 081704 ;
摘要
Dense video caption is a task that aims to help computers analyze the content of a video by generating abstract captions for a sequence of video frames. However, most of the existing methods only use visual features in the video and ignore the audio features that are also essential for understanding the video. In this paper, we propose a fusion model that combines the Transformer framework to integrate both visual and audio features in the video for captioning. We use multi-head attention to deal with the variations in sequence lengths between the models involved in our approach. We also introduce a Common Pool to store the generated features and align them with the time steps, thus filtering the information and eliminating redundancy based on the confidence scores. Moreover, we use LSTM as a decoder to generate the description sentences, which reduces the memory size of the entire network. Experiments show that our method is competitive on the ActivityNet Captions dataset.
引用
收藏
页数:16
相关论文
共 50 条
  • [21] MULTI-MODAL INFORMATION FUSION FOR NEWS STORY SEGMENTATION IN BROADCAST VIDEO
    Feng, Bailan
    Ding, Peng
    Chen, Jiansong
    Bai, Jinfeng
    Xu, Su
    Xu, Bo
    2012 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2012, : 1417 - 1420
  • [22] Video Visual Relation Detection via Multi-modal Feature Fusion
    Sun, Xu
    Ren, Tongwei
    Zi, Yuan
    Wu, Gangshan
    PROCEEDINGS OF THE 27TH ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA (MM'19), 2019, : 2657 - 2661
  • [23] Language-guided Multi-Modal Fusion for Video Action Recognition
    Hsiao, Jenhao
    Li, Yikang
    Ho, Chiuman
    2021 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION WORKSHOPS (ICCVW 2021), 2021, : 3151 - 3155
  • [24] MMTF: Multi-Modal Temporal Fusion for Commonsense Video Question Answering
    Ahmad, Mobeen
    Park, Geonwoo
    Park, Dongchan
    Park, Sanguk
    2023 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION WORKSHOPS, ICCVW, 2023, : 4659 - 4664
  • [25] Fusion of Learned Multi-Modal Representations and Dense Trajectories for Emotional Analysis in Videos
    Acar, Esra
    Hopfgartner, Frank
    Albayrak, Sahin
    2015 13TH INTERNATIONAL WORKSHOP ON CONTENT-BASED MULTIMEDIA INDEXING (CBMI), 2015,
  • [26] Video Relation Detection with Trajectory-aware Multi-modal Features
    Xie, Wentao
    Ren, Guanghui
    Liu, Si
    MM '20: PROCEEDINGS OF THE 28TH ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA, 2020, : 4590 - 4594
  • [27] Event-centric Multi-modal Fusion Method for Dense Video Captioning (vol 146, pg 120, 2022)
    Chang, Zhi
    Zhao, Dexin
    Chen, Huilin
    Li, Jingdan
    Liu, Pengfei
    NEURAL NETWORKS, 2022, 152 : 527 - 527
  • [28] Global-Shared Text Representation Based Multi-Stage Fusion Transformer Network for Multi-Modal Dense Video Captioning
    Xie, Yulai
    Niu, Jingjing
    Zhang, Yang
    Ren, Fang
    IEEE TRANSACTIONS ON MULTIMEDIA, 2024, 26 : 3164 - 3179
  • [29] MRCap: Multi-modal and Multi-level Relationship-based Dense Video Captioning
    Chen, Wei
    Niu, Jianwei
    Liu, Xuefeng
    2023 IEEE INTERNATIONAL CONFERENCE ON MULTIMEDIA AND EXPO, ICME, 2023, : 2615 - 2620
  • [30] Deep fusion of multi-modal features for brain tumor image segmentation
    Zhang, Guying
    Zhou, Jia
    He, Guanghua
    Zhu, Hancan
    HELIYON, 2023, 9 (08)