Harnessing Representative Spatial-Temporal Information for Video Question Answering

被引：0

作者：

Wang, Yuanyuan ^{[1
]}

Liu, Meng ^{[2
]}

Song, Xuemeng ^{[1
]}

Nie, Liqiang ^{[3
]}

机构：

[1] Shandong Univ, Qingdao, Peoples R China

[2] Shandong Jianzhu Univ, Jinan, Peoples R China

[3] Harbin Inst Technol, Shenzhen, Peoples R China

来源：

ACM TRANSACTIONS ON MULTIMEDIA COMPUTING COMMUNICATIONS AND APPLICATIONS | 2024年 / 20卷 / 10期

基金：

中国国家自然科学基金;

关键词：

Additional Key Words and Phrases; Video question answering; uncertainty estimation; expectation-maximization attention;

D O I：

10.1145/3675399

中图分类号：

TP [自动化技术、计算机技术];

学科分类号：

0812 ;

摘要：

Video question answering, aiming to answer a natural language question related to the given video, has become prevalent in the past few years. Although remarkable improvements have been obtained, it is still exposed to the challenge of insufficient comprehension of video content. To this end, we propose a spatial-temporal representative visual exploitation network for video question answering, which enhances the understanding of the video by merely summarizing representative visual information. In order to explore representative object information, we advance adaptive attention based on uncertainty estimation. At the same time, to capture representative frame-level and clip-level visual information, we structure a much more compact set of representations iteratively in an expectation-maximization manner to deprecate noisy information. Both the quantitative and qualitative results on NExT-QA, TGIF-QA, MSRVTT-QA, and MSVD-QA datasets demonstrate the superiority of our model over several state-of-the-art approaches.

引用

页数：20

共 50 条

[1] STAIR: Spatial-Temporal Reasoning with Auditable Intermediate Results for Video Question Answering
Wang, Yueqian
Wang, Yuxuan
Chen, Kai
Zhao, Dongyan
THIRTY-EIGHTH AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE, VOL 38 NO 17, 2024, : 19215 - 19223
[2] Video Question Answering via Knowledge-based Progressive Spatial-Temporal Attention Network
Jin, Weike
Zhao, Zhou
Li, Yimeng
Li, Jie
Xiao, Jun
Zhuang, Yueting
ACM TRANSACTIONS ON MULTIMEDIA COMPUTING COMMUNICATIONS AND APPLICATIONS, 2019, 15 (02)
[3] MIST : Multi-modal Iterative Spatial-Temporal Transformer for Long-form Video Question Answering
Gao, Difei
Zhou, Luowei
Ji, Lei
Zhu, Linchao
Yang, Yi
Shou, Mike Zheng
2023 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2023, : 14773 - 14783
[4] A video segmentation algorithm based on spatial-temporal information
Zhu, H
Li, ZM
2002 INTERNATIONAL CONFERENCE ON COMMUNICATIONS, CIRCUITS AND SYSTEMS AND WEST SINO EXPOSITION PROCEEDINGS, VOLS 1-4, 2002, : 566 - 569
[5] Question answering with imperfect temporal information
Schockaert, Steven
Ahn, David
De Cock, Martine
Kerre, Etienne E.
FLEXIBLE QUERY ANSWERING SYSTEMS, PROCEEDINGS, 2006, 4027 : 647 - 658
[6] Uncovering the Temporal Context for Video Question Answering
Linchao Zhu
Zhongwen Xu
Yi Yang
Alexander G. Hauptmann
International Journal of Computer Vision, 2017, 124 : 409 - 421
[7] Uncovering the Temporal Context for Video Question Answering
Zhu, Linchao
Xu, Zhongwen
Yang, Yi
Hauptmann, Alexander G.
INTERNATIONAL JOURNAL OF COMPUTER VISION, 2017, 124 (03) : 409 - 421
[8] Video foreground segmentation based on analysis of spatial-temporal information
Min, Hua-Qing
Chen, Cong
Luo, Rong-Hua
Zhu, Jin-Hui
Moshi Shibie yu Rengong Zhineng/Pattern Recognition and Artificial Intelligence, 2011, 24 (04): : 582 - 590
[9] Underwater video dehazing based on spatial-temporal information fusion
Qing, Chunmei
Yu, Feng
Xu, Xiangmin
Huang, Wenyou
Jin, Jianxiu
MULTIDIMENSIONAL SYSTEMS AND SIGNAL PROCESSING, 2016, 27 (04) : 909 - 924
[10] Event Graph Guided Compositional Spatial--Temporal Reasoning for Video Question Answering
Bai, Ziyi
Wang, Ruiping
Gao, Difei
Chen, Xilin
IEEE TRANSACTIONS ON IMAGE PROCESSING, 2024, 33 : 1109 - 1121

← 1 2 3 4 5 →