Harnessing Representative Spatial-Temporal Information for Video Question Answering

被引：0

作者：

Wang, Yuanyuan ^{[1
]}

Liu, Meng ^{[2
]}

Song, Xuemeng ^{[1
]}

Nie, Liqiang ^{[3
]}

机构：

[1] Shandong Univ, Qingdao, Peoples R China

[2] Shandong Jianzhu Univ, Jinan, Peoples R China

[3] Harbin Inst Technol, Shenzhen, Peoples R China

来源：

ACM TRANSACTIONS ON MULTIMEDIA COMPUTING COMMUNICATIONS AND APPLICATIONS | 2024年 / 20卷 / 10期

基金：

中国国家自然科学基金;

关键词：

Additional Key Words and Phrases; Video question answering; uncertainty estimation; expectation-maximization attention;

D O I：

10.1145/3675399

中图分类号：

TP [自动化技术、计算机技术];

学科分类号：

0812 ;

摘要：

Video question answering, aiming to answer a natural language question related to the given video, has become prevalent in the past few years. Although remarkable improvements have been obtained, it is still exposed to the challenge of insufficient comprehension of video content. To this end, we propose a spatial-temporal representative visual exploitation network for video question answering, which enhances the understanding of the video by merely summarizing representative visual information. In order to explore representative object information, we advance adaptive attention based on uncertainty estimation. At the same time, to capture representative frame-level and clip-level visual information, we structure a much more compact set of representations iteratively in an expectation-maximization manner to deprecate noisy information. Both the quantitative and qualitative results on NExT-QA, TGIF-QA, MSRVTT-QA, and MSVD-QA datasets demonstrate the superiority of our model over several state-of-the-art approaches.

引用

页数：20

共 50 条

[31] Large Language Models are Temporal and Causal Reasoners for Video Question Answering
Ko, Dohwan
Lee, Ji Soo
Kang, Wooyoung
Roh, Byungseok
Kim, Hyunwoo J.
2023 CONFERENCE ON EMPIRICAL METHODS IN NATURAL LANGUAGE PROCESSING, EMNLP 2023, 2023, : 4300 - 4316
[32] Application of temporal information extraction techniques to question answering systems
Teresa Vicente-Diez, Maria
Martinez, Paloma
Martinez-Gonzalez, Angel
Luis Martinez-Fernandez, Jose
PROCESAMIENTO DEL LENGUAJE NATURAL, 2009, (42): : 25 - 30
[33] Spatio-Temporal Graph Convolution Transformer for Video Question Answering
Tang, Jiahao
Hu, Jianguo
Huang, Wenjun
Shen, Shengzhi
Pan, Jiakai
Wang, Deming
Ding, Yanyu
IEEE ACCESS, 2024, 12 : 131664 - 131680
[34] Dynamic Spatio-Temporal Modular Network for Video Question Answering
Qian, Zi
Wang, Xin
Duan, Xuguang
Chen, Hong
Zhu, Wenwu
PROCEEDINGS OF THE 30TH ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA, MM 2022, 2022, : 4466 - 4477
[35] Affective question answering on video
Ruwa, Nelson
Mao, Qirong
Wang, Liangjun
Gou, Jianping
NEUROCOMPUTING, 2019, 363 : 125 - 139
[36] Video Captioning Based on the Spatial-Temporal Saliency Tracing
Zhou, Yuanen
Hu, Zhenzhen
Liu, Xueliang
Wang, Meng
ADVANCES IN MULTIMEDIA INFORMATION PROCESSING, PT I, 2018, 11164 : 59 - 70
[37] Deep Video Harmonization by Improving Spatial-temporal Consistency
Xiuwen Chen
Li Fang
Long Ye
Qin Zhang
Machine Intelligence Research, 2024, 21 : 46 - 54
[38] Spatial-Temporal Separable Attention for Video Action Recognition
Guo, Xi
Hu, Yikun
Chen, Fang
Jin, Yuhui
Qiao, Jian
Huang, Jian
Yang, Qin
2022 INTERNATIONAL CONFERENCE ON FRONTIERS OF ARTIFICIAL INTELLIGENCE AND MACHINE LEARNING, FAIML, 2022, : 224 - 228
[39] Spatial-Temporal Transformer for Video Snapshot Compressive Imaging
Wang, Lishun
Cao, Miao
Zhong, Yong
Yuan, Xin
IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, 2023, 45 (07) : 9072 - 9089
[40] ShiftFormer: Spatial-Temporal Shift Operation in Video Transformer
Yang, Beiying
Zhu, Guibo
Ge, Guojing
Luo, Jinzhao
Wang, Jinqiao
2023 IEEE INTERNATIONAL CONFERENCE ON MULTIMEDIA AND EXPO, ICME, 2023, : 1895 - 1900

← 1 2 3 4 5 →