Fusion of Multi-Modal Features to Enhance Dense Video Caption

被引:3
|
作者
Huang, Xuefei [1 ]
Chan, Ka-Hou [1 ,2 ]
Wu, Weifan [1 ]
Sheng, Hao [1 ,3 ,4 ]
Ke, Wei [1 ,2 ]
机构
[1] Macao Polytech Univ, Fac Appl Sci, Macau 999078, Peoples R China
[2] Macao Polytech Univ, Engn Res Ctr Appl Technol Machine Translat & Artif, Minist Educ, Macau 999078, Peoples R China
[3] Beihang Univ, Sch Comp Sci & Engn, State Key Lab Virtual Real Technol & Syst, Beijing 100191, Peoples R China
[4] Beihang Hangzhou Innovat Inst Yuhang, Hangzhou 310023, Peoples R China
基金
中国国家自然科学基金;
关键词
dense video caption; video captioning; multi-modal feature fusion; feature extraction; neural network; TRACKING;
D O I
10.3390/s23125565
中图分类号
O65 [分析化学];
学科分类号
070302 ; 081704 ;
摘要
Dense video caption is a task that aims to help computers analyze the content of a video by generating abstract captions for a sequence of video frames. However, most of the existing methods only use visual features in the video and ignore the audio features that are also essential for understanding the video. In this paper, we propose a fusion model that combines the Transformer framework to integrate both visual and audio features in the video for captioning. We use multi-head attention to deal with the variations in sequence lengths between the models involved in our approach. We also introduce a Common Pool to store the generated features and align them with the time steps, thus filtering the information and eliminating redundancy based on the confidence scores. Moreover, we use LSTM as a decoder to generate the description sentences, which reduces the memory size of the entire network. Experiments show that our method is competitive on the ActivityNet Captions dataset.
引用
收藏
页数:16
相关论文
共 50 条
  • [1] Parallel Dense Video Caption Generation with Multi-Modal Features
    Huang, Xuefei
    Chan, Ka-Hou
    Ke, Wei
    Sheng, Hao
    MATHEMATICS, 2023, 11 (17)
  • [2] Multi-modal Dense Video Captioning
    Iashin, Vladimir
    Rahtu, Esa
    2020 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION WORKSHOPS (CVPRW 2020), 2020, : 4117 - 4126
  • [3] Multi-modal fusion for video understanding
    Hoogs, A
    Mundy, J
    Cross, G
    30TH APPLIED IMAGERY PATTERN RECOGNITION WORKSHOP, PROCEEDINGS: ANALYSIS AND UNDERSTANDING OF TIME VARYING IMAGERY, 2001, : 103 - 108
  • [4] Event-centric multi-modal fusion method for dense video captioning
    Chang, Zhi
    Zhao, Dexin
    Chen, Huilin
    Li, Jingdan
    Liu, Pengfei
    NEURAL NETWORKS, 2022, 146 : 120 - 129
  • [5] VIDEO MEMORABILITY PREDICTION VIA LATE FUSION OF DEEP MULTI-MODAL FEATURES
    Leyva, Roberto
    Sanchez, Victor
    2021 IEEE INTERNATIONAL CONFERENCE ON IMAGE PROCESSING (ICIP), 2021, : 2488 - 2492
  • [6] Layer-wise enhanced transformer with multi-modal fusion for image caption
    Li, Jingdan
    Wang, Yi
    Zhao, Dexin
    MULTIMEDIA SYSTEMS, 2023, 29 (03) : 1043 - 1056
  • [7] Layer-wise enhanced transformer with multi-modal fusion for image caption
    Jingdan Li
    Yi Wang
    Dexin Zhao
    Multimedia Systems, 2023, 29 : 1043 - 1056
  • [8] Class Consistent Multi-Modal Fusion with Binary Features
    Shrivastava, Ashish
    Rastegari, Mohammad
    Shekhar, Sumit
    Chellappa, Rama
    Davis, Larry S.
    2015 IEEE CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2015, : 2282 - 2291
  • [9] Multi-modal Fusion
    Liu, Huaping
    Hussain, Amir
    Wang, Shuliang
    INFORMATION SCIENCES, 2018, 432 : 462 - 462
  • [10] Everything at Once - Multi-modal Fusion Transformer for Video Retrieval
    Shvetsova, Nina
    Chen, Brian
    Rouditchenko, Andrew
    Thomas, Samuel
    Kingsbury, Brian
    Feris, Rogerio
    Harwath, David
    Glass, James
    Kuehne, Hilde
    2022 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2022), 2022, : 19988 - 19997