Attention-based Visual-Audio Fusion for Video Caption Generation

被引:0
|
作者
Guo, Ningning [1 ]
Liu, Huaping [2 ]
Jiang, Linhua [1 ]
机构
[1] Univ Shanghai Sci & Technol, Dept Comp Technol, Shanghai, Peoples R China
[2] Tsinghua Univ, Dept Comp Sci & Technol, Beijing, Peoples R China
基金
中国国家自然科学基金;
关键词
D O I
10.1109/icarm.2019.8834066
中图分类号
T [工业技术];
学科分类号
08 ;
摘要
Recently, most of the work of generating a text description from a video is based on an Encoder-Decoder framework. Firstly, in the encoder stage, different convolutional neural networks are using to extract features from audio and visual modalities respectively, and then the extracted features are input into the decoder stage, which will use the LSTM to generate the caption of video. Currently, there are two types of work concerns. One is whether video caption will be generated accurately if different multimodal fusion strategies are adopted. Another is whether video caption will be generated more accurately if attention mechanism is added. In this paper, we come up with a fusion framework which combines the two types of methods concerned above to produce a new model. In the encoder stage, two modalities of multimodal fusion, sharing weights and sharing memory are utilized respectively, which can make the two kinds of characteristics resonated to generated the final feature outputs. LSTM with attention mechanism are used in the decoder state to generate a description of video. Our fusion model combining the two methods is well validated on the dataset Microsoft Research Video to Text (MSR-VTT).
引用
收藏
页码:839 / 844
页数:6
相关论文
共 50 条
  • [1] Attention-Based Audio-Visual Fusion for Video Summarization
    Fang, Yinghong
    Zhang, Junpeng
    Lu, Cewu
    [J]. NEURAL INFORMATION PROCESSING (ICONIP 2019), PT II, 2019, 11954 : 328 - 340
  • [2] Attention-based Audio-Visual Fusion for Robust Automatic Speech Recognition
    Sterpu, George
    Saam, Christian
    Harte, Naomi
    [J]. ICMI'18: PROCEEDINGS OF THE 20TH ACM INTERNATIONAL CONFERENCE ON MULTIMODAL INTERACTION, 2018, : 111 - 115
  • [3] Attention-based cross-modal fusion for audio-visual voice activity detection in musical video streams
    Hou, Yuanbo
    Yu, Zhesong
    Liang, Xia
    Du, Xingjian
    Zhu, Bilei
    Ma, Zejun
    Botteldooren, Dick
    [J]. INTERSPEECH 2021, 2021, : 321 - 325
  • [4] Attention-based Visual Question Generation
    Patil, Charulata
    Kulkarni, Anagha
    [J]. 2021 INTERNATIONAL CONFERENCE ON EMERGING SMART COMPUTING AND INFORMATICS (ESCI), 2021, : 82 - 86
  • [5] Residual Attention-based Fusion for Video Classification
    Pouyanfar, Samira
    Wang, Tianyi
    Chen, Shu-Ching
    [J]. 2019 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION WORKSHOPS (CVPRW 2019), 2019, : 478 - 480
  • [6] Attention-Based Multimodal Fusion for Video Description
    Hori, Chiori
    Hori, Takaaki
    Lee, Teng-Yok
    Zhang, Ziming
    Harsham, Bret
    Hershey, John R.
    Marks, Tim K.
    Sumi, Kazuhiko
    [J]. 2017 IEEE INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV), 2017, : 4203 - 4212
  • [7] A framework for estimating geometric distortions in video copies based on visual-audio fingerprints
    R. Roopalakshmi
    G. Ram Mohana Reddy
    [J]. Signal, Image and Video Processing, 2015, 9 : 201 - 210
  • [8] Hierarchical Attention-Based Fusion for Image Caption With Multi-Grained Rewards
    Wu, Chunlei
    Yuan, Shaozu
    Cao, Haiwen
    Wei, Yiwei
    Wang, Leiquan
    [J]. IEEE ACCESS, 2020, 8 : 57943 - 57951
  • [9] A framework for estimating geometric distortions in video copies based on visual-audio fingerprints
    Roopalakshmi, R.
    Reddy, G. Ram Mohana
    [J]. SIGNAL IMAGE AND VIDEO PROCESSING, 2015, 9 (01) : 201 - 210
  • [10] Caption Generation for Sensing-Based Activity Using Attention-Based Learning Models
    Pati, Bhabanisankar
    Sahoo, Ajit Kumar
    Udgata, Siba K.
    [J]. IEEE SENSORS LETTERS, 2024, 8 (03) : 1 - 4