Attention-based Visual-Audio Fusion for Video Caption Generation

被引：0

作者：

Guo, Ningning ^{[1
]}

Liu, Huaping ^{[2
]}

Jiang, Linhua ^{[1
]}

机构：

[1] Univ Shanghai Sci & Technol, Dept Comp Technol, Shanghai, Peoples R China

[2] Tsinghua Univ, Dept Comp Sci & Technol, Beijing, Peoples R China

来源：

2019 IEEE 4TH INTERNATIONAL CONFERENCE ON ADVANCED ROBOTICS AND MECHATRONICS (ICARM 2019) | 2019年

基金：

中国国家自然科学基金;

关键词：

D O I：

10.1109/icarm.2019.8834066

中图分类号：

T [工业技术];

学科分类号：

08 ;

摘要：

Recently, most of the work of generating a text description from a video is based on an Encoder-Decoder framework. Firstly, in the encoder stage, different convolutional neural networks are using to extract features from audio and visual modalities respectively, and then the extracted features are input into the decoder stage, which will use the LSTM to generate the caption of video. Currently, there are two types of work concerns. One is whether video caption will be generated accurately if different multimodal fusion strategies are adopted. Another is whether video caption will be generated more accurately if attention mechanism is added. In this paper, we come up with a fusion framework which combines the two types of methods concerned above to produce a new model. In the encoder stage, two modalities of multimodal fusion, sharing weights and sharing memory are utilized respectively, which can make the two kinds of characteristics resonated to generated the final feature outputs. LSTM with attention mechanism are used in the decoder state to generate a description of video. Our fusion model combining the two methods is well validated on the dataset Microsoft Research Video to Text (MSR-VTT).

引用

页码：839 / 844

页数：6

共 50 条

[1] Attention-Based Audio-Visual Fusion for Video Summarization
Fang, Yinghong
Zhang, Junpeng
Lu, Cewu
[J]. NEURAL INFORMATION PROCESSING (ICONIP 2019), PT II, 2019, 11954 : 328 - 340
[2] Attention-based Audio-Visual Fusion for Robust Automatic Speech Recognition
Sterpu, George
Saam, Christian
Harte, Naomi
[J]. ICMI'18: PROCEEDINGS OF THE 20TH ACM INTERNATIONAL CONFERENCE ON MULTIMODAL INTERACTION, 2018, : 111 - 115
[3] Attention-based cross-modal fusion for audio-visual voice activity detection in musical video streams
Hou, Yuanbo
Yu, Zhesong
Liang, Xia
Du, Xingjian
Zhu, Bilei
Ma, Zejun
Botteldooren, Dick
[J]. INTERSPEECH 2021, 2021, : 321 - 325
[4] Attention-based Visual Question Generation
Patil, Charulata
Kulkarni, Anagha
[J]. 2021 INTERNATIONAL CONFERENCE ON EMERGING SMART COMPUTING AND INFORMATICS (ESCI), 2021, : 82 - 86
[5] Residual Attention-based Fusion for Video Classification
Pouyanfar, Samira
Wang, Tianyi
Chen, Shu-Ching
[J]. 2019 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION WORKSHOPS (CVPRW 2019), 2019, : 478 - 480
[6] Attention-Based Multimodal Fusion for Video Description
Hori, Chiori
Hori, Takaaki
Lee, Teng-Yok
Zhang, Ziming
Harsham, Bret
Hershey, John R.
Marks, Tim K.
Sumi, Kazuhiko
[J]. 2017 IEEE INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV), 2017, : 4203 - 4212
[7] A framework for estimating geometric distortions in video copies based on visual-audio fingerprints
R. Roopalakshmi
G. Ram Mohana Reddy
[J]. Signal, Image and Video Processing, 2015, 9 : 201 - 210
[8] Hierarchical Attention-Based Fusion for Image Caption With Multi-Grained Rewards
Wu, Chunlei
Yuan, Shaozu
Cao, Haiwen
Wei, Yiwei
Wang, Leiquan
[J]. IEEE ACCESS, 2020, 8 : 57943 - 57951
[9] A framework for estimating geometric distortions in video copies based on visual-audio fingerprints
Roopalakshmi, R.
Reddy, G. Ram Mohana
[J]. SIGNAL IMAGE AND VIDEO PROCESSING, 2015, 9 (01) : 201 - 210
[10] Caption Generation for Sensing-Based Activity Using Attention-Based Learning Models
Pati, Bhabanisankar
Sahoo, Ajit Kumar
Udgata, Siba K.
[J]. IEEE SENSORS LETTERS, 2024, 8 (03) : 1 - 4

← 1 2 3 4 5 →