Unifying the Video and Question Attentions for Open-Ended Video Question Answering

被引：47

作者：

Xue, Hongyang ^{[1
]}

Zhao, Zhou ^{[2
]}

Cai, Deng ^{[1
]}

机构：

[1] Zhejiang Univ, State Key Lab CAD&CG, Hangzhou 310027, Zhejiang, Peoples R China

[2] Zhejiang Univ, Coll Comp Sci, Hangzhou 310027, Zhejiang, Peoples R China

来源：

IEEE TRANSACTIONS ON IMAGE PROCESSING | 2017年 / 26卷 / 12期

关键词：

Video question answering; attention model; scene understanding;

D O I：

10.1109/TIP.2017.2746267

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

Video question answering is an important task toward scene understanding and visual data retrieval. However, current visual question answering works mainly focus on a single static image, which is distinct from the dynamic and sequential visual data in the real world. Their approaches cannot utilize the temporal information in videos. In this paper, we introduce the task of free-form open-ended video question answering. The open-ended answers enable wider applications compared with the common multiple-choice tasks in Visual-QA. We first propose a data set for open-ended Video-QA with the automatic question generation approaches. Then, we propose our sequential video attention and temporal question attention models. These two models apply the attention mechanism on videos and questions, while preserving the sequential and temporal structures of the guides. The two models are integrated into the model of unified attention. After the video and the question are encoded, the answers are generated wordwisely from our models by a decoder. In the end, we evaluate our models on the proposed data set. The experimental results demonstrate the effectiveness of our proposed model.

引用

页码：5656 / 5666

页数：11

共 50 条

[41] On the hidden treasure of dialog in video question answering
Engin, Deniz
Schnitzler, Francois
Duong, Ngoc Q. K.
Avrithis, Yannis
2021 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2021), 2021, : 2044 - 2053
[42] Question answering on large news video archive
Chua, TS
ISPA 2003: PROCEEDINGS OF THE 3RD INTERNATIONAL SYMPOSIUM ON IMAGE AND SIGNAL PROCESSING AND ANALYSIS, PTS 1 AND 2, 2003, : 289 - 294
[43] Uncovering the Temporal Context for Video Question Answering
Zhu, Linchao
Xu, Zhongwen
Yang, Yi
Hauptmann, Alexander G.
INTERNATIONAL JOURNAL OF COMPUTER VISION, 2017, 124 (03) : 409 - 421
[44] Video Question Answering With Semantic Disentanglement and Reasoning
Liu, Jin
Wang, Guoxiang
Xie, Jialong
Zhou, Fengyu
Xu, Huijuan
IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, 2024, 34 (05) : 3663 - 3673
[45] Embedding VLAD in Transformer for Video Question Answering
Guo D.
Yao S.-T.
Wang H.
Wang M.
Jisuanji Xuebao/Chinese Journal of Computers, 2023, 46 (04): : 671 - 689
[46] Video Question Answering: a Survey of Models and Datasets
Sun, Guanglu
Liang, Lili
Li, Tianlin
Yu, Bo
Wu, Meng
Zhang, Bolun
MOBILE NETWORKS & APPLICATIONS, 2021, 26 (05): : 1904 - 1937
[47] Complementary spatiotemporal network for video question answering
Xinrui Li
Aming Wu
Yahong Han
Multimedia Systems, 2022, 28 : 161 - 169
[48] Video Question Answering: a Survey of Models and Datasets
Guanglu Sun
Lili Liang
Tianlin Li
Bo Yu
Meng Wu
Bolun Zhang
Mobile Networks and Applications, 2021, 26 : 1904 - 1937
[49] Video question answering via traffic knowledge database and question classification
Xiaoyong Sun
Yu Dai
Yuchen Wang
Weifeng Ma
Xuefen Lin
Multimedia Systems, 2024, 30
[50] Video question answering via traffic knowledge database and question classification
Sun, Xiaoyong
Dai, Yu
Wang, Yuchen
Ma, Weifeng
Lin, Xuefen
MULTIMEDIA SYSTEMS, 2024, 30 (01)

← 1 2 3 4 5 →