Harnessing Representative Spatial-Temporal Information for Video Question Answering

被引：0

作者：

Wang, Yuanyuan ^{[1
]}

Liu, Meng ^{[2
]}

Song, Xuemeng ^{[1
]}

Nie, Liqiang ^{[3
]}

机构：

[1] Shandong Univ, Qingdao, Peoples R China

[2] Shandong Jianzhu Univ, Jinan, Peoples R China

[3] Harbin Inst Technol, Shenzhen, Peoples R China

来源：

ACM TRANSACTIONS ON MULTIMEDIA COMPUTING COMMUNICATIONS AND APPLICATIONS | 2024年 / 20卷 / 10期

基金：

中国国家自然科学基金;

关键词：

Additional Key Words and Phrases; Video question answering; uncertainty estimation; expectation-maximization attention;

D O I：

10.1145/3675399

中图分类号：

TP [自动化技术、计算机技术];

学科分类号：

0812 ;

摘要：

Video question answering, aiming to answer a natural language question related to the given video, has become prevalent in the past few years. Although remarkable improvements have been obtained, it is still exposed to the challenge of insufficient comprehension of video content. To this end, we propose a spatial-temporal representative visual exploitation network for video question answering, which enhances the understanding of the video by merely summarizing representative visual information. In order to explore representative object information, we advance adaptive attention based on uncertainty estimation. At the same time, to capture representative frame-level and clip-level visual information, we structure a much more compact set of representations iteratively in an expectation-maximization manner to deprecate noisy information. Both the quantitative and qualitative results on NExT-QA, TGIF-QA, MSRVTT-QA, and MSVD-QA datasets demonstrate the superiority of our model over several state-of-the-art approaches.

引用

页数：20

共 50 条

[41] A spatial-temporal approach for video caption detection and recognition
Tang, X
Gao, XB
Liu, JZ
Zhang, HJ
IEEE TRANSACTIONS ON NEURAL NETWORKS, 2002, 13 (04): : 961 - 971
[42] Using Spatial-Temporal Attention for Video Quality Evaluation
Chi, Biwei
Su, Ruifang
Chen, Xinhui
INTERNATIONAL JOURNAL OF INTELLIGENT SYSTEMS, 2024, 2024
[43] Enhanced spatial-temporal freedom for video frame interpolation
Li, Hao-Dong
Yin, Hui
Liu, Zhi-Hao
Huang, Hua
APPLIED INTELLIGENCE, 2023, 53 (09) : 10535 - 10547
[44] STAT: Spatial-Temporal Attention Mechanism for Video Captioning
Yan, Chenggang
Tu, Yunbin
Wang, Xingzheng
Zhang, Yongbing
Hao, Xinhong
Zhang, Yongdong
Dai, Qionghai
IEEE TRANSACTIONS ON MULTIMEDIA, 2020, 22 (01) : 229 - 241
[45] A new spatial-temporal representation structure of symbolic video
Yu, Ping
ICIC Express Letters, 2011, 5 (11): : 4013 - 4019
[46] Spatial-temporal features for smoke detections on video images
Ma, Li
PROCEEDINGS OF 3RD INTERNATIONAL CONFERENCE ON MULTIMEDIA TECHNOLOGY (ICMT-13), 2013, 84 : 1284 - 1291
[47] Efficient Video Transformers with Spatial-Temporal Token Selection
Wang, Junke
Yang, Xitong
Li, Hengduo
Liu, Li
Wu, Zuxuan
Jiang, Yu-Gang
COMPUTER VISION - ECCV 2022, PT XXXV, 2022, 13695 : 69 - 86
[48] Video Scene Graph Generation with Spatial-Temporal Knowledge
Pu, Tao
PROCEEDINGS OF THE 31ST ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA, MM 2023, 2023, : 9340 - 9344
[49] Contrast Based Hierarchical Spatial-Temporal Saliency for Video
Le, Trung-Nghia
Sugimoto, Akihiro
IMAGE AND VIDEO TECHNOLOGY, PSIVT 2015, 2016, 9431 : 734 - 748
[50] COLLABORATIVE SPATIAL-TEMPORAL DISTILLATION FOR EFFICIENT VIDEO DERAINING
Hu, Yuzhang
Liu, Minghao
Yang, Wenhan
Liu, Jiaying
Guo, Zongming
2023 IEEE INTERNATIONAL CONFERENCE ON MULTIMEDIA AND EXPO, ICME, 2023, : 1937 - 1942

← 1 2 3 4 5 →