Fusion of Multi-Modal Features to Enhance Dense Video Caption

被引:3
|
作者
Huang, Xuefei [1 ]
Chan, Ka-Hou [1 ,2 ]
Wu, Weifan [1 ]
Sheng, Hao [1 ,3 ,4 ]
Ke, Wei [1 ,2 ]
机构
[1] Macao Polytech Univ, Fac Appl Sci, Macau 999078, Peoples R China
[2] Macao Polytech Univ, Engn Res Ctr Appl Technol Machine Translat & Artif, Minist Educ, Macau 999078, Peoples R China
[3] Beihang Univ, Sch Comp Sci & Engn, State Key Lab Virtual Real Technol & Syst, Beijing 100191, Peoples R China
[4] Beihang Hangzhou Innovat Inst Yuhang, Hangzhou 310023, Peoples R China
基金
中国国家自然科学基金;
关键词
dense video caption; video captioning; multi-modal feature fusion; feature extraction; neural network; TRACKING;
D O I
10.3390/s23125565
中图分类号
O65 [分析化学];
学科分类号
070302 ; 081704 ;
摘要
Dense video caption is a task that aims to help computers analyze the content of a video by generating abstract captions for a sequence of video frames. However, most of the existing methods only use visual features in the video and ignore the audio features that are also essential for understanding the video. In this paper, we propose a fusion model that combines the Transformer framework to integrate both visual and audio features in the video for captioning. We use multi-head attention to deal with the variations in sequence lengths between the models involved in our approach. We also introduce a Common Pool to store the generated features and align them with the time steps, thus filtering the information and eliminating redundancy based on the confidence scores. Moreover, we use LSTM as a decoder to generate the description sentences, which reduces the memory size of the entire network. Experiments show that our method is competitive on the ActivityNet Captions dataset.
引用
收藏
页数:16
相关论文
共 50 条
  • [31] Learning Visual Emotion Distributions via Multi-Modal Features Fusion
    Zhao, Sicheng
    Ding, Guiguang
    Gao, Yue
    Han, Jungong
    PROCEEDINGS OF THE 2017 ACM MULTIMEDIA CONFERENCE (MM'17), 2017, : 369 - 377
  • [32] Gated Multi-modal Fusion with Cross-modal Contrastive Learning for Video Question Answering
    Lyu, Chenyang
    Li, Wenxi
    Ji, Tianbo
    Zhou, Liting
    Gurrin, Cathal
    ARTIFICIAL NEURAL NETWORKS AND MACHINE LEARNING, ICANN 2023, PT VII, 2023, 14260 : 427 - 438
  • [33] A Novel Deep Multi-Modal Feature Fusion Method for Celebrity Video Identification
    Chen, Jianrong
    Yang, Li
    Xu, Yuanyuan
    Huo, Jing
    Shi, Yinghuan
    Gao, Yang
    PROCEEDINGS OF THE 27TH ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA (MM'19), 2019, : 2535 - 2538
  • [34] Visual-guided hierarchical iterative fusion for multi-modal video action
    Zhang, Bingbing
    Zhang, Ying
    Zhang, Jianxin
    Sun, Qiule
    Wang, Rong
    Zhang, Qiang
    PATTERN RECOGNITION LETTERS, 2024, 186 : 213 - 220
  • [35] Multi-modal video event recognition based on association rules and decision fusion
    Guder, Mennan
    Cicekli, Nihan Kesim
    MULTIMEDIA SYSTEMS, 2018, 24 (01) : 55 - 72
  • [36] Multi-modal video event recognition based on association rules and decision fusion
    Mennan Güder
    Nihan Kesim Çiçekli
    Multimedia Systems, 2018, 24 : 55 - 72
  • [37] Soft multi-modal data fusion
    Coppock, S
    Mazack, L
    PROCEEDINGS OF THE 12TH IEEE INTERNATIONAL CONFERENCE ON FUZZY SYSTEMS, VOLS 1 AND 2, 2003, : 636 - 641
  • [38] Multi-modal data fusion: A description
    Coppock, S
    Mazlack, LJ
    KNOWLEDGE-BASED INTELLIGENT INFORMATION AND ENGINEERING SYSTEMS, PT 2, PROCEEDINGS, 2004, 3214 : 1136 - 1142
  • [39] Image caption of space science experiment based on multi-modal learning
    Li P.-Z.
    Wan X.
    Li S.-Y.
    Guangxue Jingmi Gongcheng/Optics and Precision Engineering, 2021, 29 (12): : 2944 - 2955
  • [40] MidFusNet: Mid-dense Fusion Network for Multi-modal Brain MRI Segmentation
    Duan, Wenting
    Zhang, Lei
    Colman, Jordan
    Gulli, Giosue
    Ye, Xujiong
    BRAINLESION: GLIOMA, MULTIPLE SCLEROSIS, STROKE AND TRAUMATIC BRAIN INJURIES, BRAINLES 2022, 2023, 13769 : 102 - 114