Fusion of Multi-Modal Features to Enhance Dense Video Caption

被引:3
|
作者
Huang, Xuefei [1 ]
Chan, Ka-Hou [1 ,2 ]
Wu, Weifan [1 ]
Sheng, Hao [1 ,3 ,4 ]
Ke, Wei [1 ,2 ]
机构
[1] Macao Polytech Univ, Fac Appl Sci, Macau 999078, Peoples R China
[2] Macao Polytech Univ, Engn Res Ctr Appl Technol Machine Translat & Artif, Minist Educ, Macau 999078, Peoples R China
[3] Beihang Univ, Sch Comp Sci & Engn, State Key Lab Virtual Real Technol & Syst, Beijing 100191, Peoples R China
[4] Beihang Hangzhou Innovat Inst Yuhang, Hangzhou 310023, Peoples R China
基金
中国国家自然科学基金;
关键词
dense video caption; video captioning; multi-modal feature fusion; feature extraction; neural network; TRACKING;
D O I
10.3390/s23125565
中图分类号
O65 [分析化学];
学科分类号
070302 ; 081704 ;
摘要
Dense video caption is a task that aims to help computers analyze the content of a video by generating abstract captions for a sequence of video frames. However, most of the existing methods only use visual features in the video and ignore the audio features that are also essential for understanding the video. In this paper, we propose a fusion model that combines the Transformer framework to integrate both visual and audio features in the video for captioning. We use multi-head attention to deal with the variations in sequence lengths between the models involved in our approach. We also introduce a Common Pool to store the generated features and align them with the time steps, thus filtering the information and eliminating redundancy based on the confidence scores. Moreover, we use LSTM as a decoder to generate the description sentences, which reduces the memory size of the entire network. Experiments show that our method is competitive on the ActivityNet Captions dataset.
引用
收藏
页数:16
相关论文
共 50 条
  • [41] Improved Multi-modal Image Fusion with Attention and Dense Networks: Visual and Quantitative Evaluation
    Banerjee, Ankan
    Patra, Dipti
    Roy, Pradipta
    COMPUTER VISION AND IMAGE PROCESSING, CVIP 2023, PT III, 2024, 2011 : 237 - 248
  • [42] Modeling Motion with Multi-Modal Features for Text-Based Video Segmentation
    Zhao, Wangbo
    Wang, Kai
    Chu, Xiangxiang
    Xue, Fuzhao
    Wang, Xinchao
    You, Yang
    2022 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2022, : 11727 - 11736
  • [43] Multi-Modal Visual Features-Based Video Shot Boundary Detection
    Tippaya, Sawitchaya
    Sitjongsataporn, Suchada
    Tan, Tele
    Khans, Masood Mehmood
    Chamnongthai, Kosin
    IEEE ACCESS, 2017, 5 : 12563 - 12575
  • [44] Automated Multi-Modal Video Editing for Ads Video
    Lin, Qin
    Pang, Nuo
    Hong, Zhiying
    PROCEEDINGS OF THE 29TH ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA, MM 2021, 2021, : 4823 - 4827
  • [45] MMCTR: A MULTI-TASK MODEL FOR SHORT VIDEO CTR PREDICTION WITH MULTI-MODAL VIDEO CONTENT FEATURES
    Wang, Jinshan
    Xu, Qianfang
    Wang, Qiang
    Lyu, Zhongjian
    Chen, Jiaxin
    Xu, Wenchao
    2019 IEEE INTERNATIONAL CONFERENCE ON MULTIMEDIA & EXPO WORKSHOPS (ICMEW), 2019, : 679 - 682
  • [46] Multi-modal feature fusion based on multi-layers LSTM for video emotion recognition
    Nie, Weizhi
    Yan, Yan
    Song, Dan
    Wang, Kun
    MULTIMEDIA TOOLS AND APPLICATIONS, 2021, 80 (11) : 16205 - 16214
  • [47] Multi-modal feature fusion based on multi-layers LSTM for video emotion recognition
    Weizhi Nie
    Yan Yan
    Dan Song
    Kun Wang
    Multimedia Tools and Applications, 2021, 80 : 16205 - 16214
  • [48] Fusion of Multi-Modal Features for Efficient Content-Based Image Retrieval
    Frigui, Hichem
    Caudill, Joshua
    Ben Abdallah, Ahmed Chamseddine
    2008 IEEE INTERNATIONAL CONFERENCE ON FUZZY SYSTEMS, VOLS 1-5, 2008, : 1994 - 2000
  • [49] Discovery and fusion of salient multi-modal features towards news story segmentation
    Hsu, W
    Chang, SF
    Huang, CW
    Kennedy, L
    Lin, CY
    Iyengar, G
    STORAGE AND RETRIEVAL METHODS AND APPLICATIONS FOR MULTIMEDIA 2004, 2004, 5307 : 244 - 258
  • [50] Multi-modal biometric system on various levels of fusion using LPQ features
    Gowda, H. D. Supreetha
    Kumar, G. Hemantha
    Imran, Mohammad
    JOURNAL OF INFORMATION & OPTIMIZATION SCIENCES, 2018, 39 (01):