Env-QA: A Video Question Answering Benchmark for Comprehensive Understanding of Dynamic Environments

被引：7

作者：

Gao, Difei ^{[1
,2
]}

Wang, Ruiping ^{[1
,2
,3
]}

Bai, Ziyi ^{[1
,2
]}

Chen, Xilin ^{[1
,2
]}

机构：

[1] Chinese Acad Sci, Inst Comp Technol, CAS, Key Lab Intelligent Informat Proc, Beijing 100190, Peoples R China

[2] Univ Chinese Acad Sci, Beijing 100049, Peoples R China

[3] Beijing Acad Artificial Intelligence, Beijing 100084, Peoples R China

来源：

2021 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2021) | 2021年

基金：

国家重点研发计划;

关键词：

D O I：

10.1109/ICCV48922.2021.00170

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

Visual understanding goes well beyond the study of images or videos on the web. To achieve complex tasks in volatile situations, the human can deeply understand the environment, quickly perceive events happening around, and continuously track objects' state changes, which are still challenging for current AI systems. To equip AI system with the ability to understand dynamic ENVironments, we build a video Question Answering dataset named Env-QA. Env-QA contains 23K egocentric videos, where each video is composed of a series of events about exploring and interacting in the environment. It also provides 85K questions to evaluate the ability of understanding the composition, layout, and state changes of the environment presented by the events in videos. Moreover, we propose a video QA model, Temporal Segmentation and Event Attention network (TSEA), which introduces event-level video representation and corresponding attention mechanisms to better extract environment information and answer questions. Comprehensive experiments demonstrate the effectiveness of our framework and show the formidable challenges of Env-QA in terms of long-term state tracking, multi-event temporal reasoning and event counting, etc.

引用

页码：1655 / 1665

页数：11

共 35 条

[21] ViOCRVQA: novel benchmark dataset and VisionReader for visual question answering by understanding Vietnamese text in images
Pham, Huy Quang
Nguyen, Thang Kien-Bao
Nguyen, Quan Van
Tran, Dan Quang
Nguyen, Nghia Hieu
Nguyen, Kiet Van
Nguyen, Ngan Luu-Thuy
MULTIMEDIA SYSTEMS, 2025, 31 (02)
[22] Understanding Video Scenes through Text: Insights from Text-based Video Question Answering
Jahagirdar, Soumya
Mathew, Minesh
Karatzas, Dimosthenis
Jawahar, C. V.
2023 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION WORKSHOPS, ICCVW, 2023, : 4648 - 4652
[23] SUTD-TrafficQA: A Question Answering Benchmark and an Efficient Network for Video Reasoning over Traffic Events
Xu, Li
Huang, He
Liu, Jun
2021 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION, CVPR 2021, 2021, : 9873 - 9883
[24] Dynamic self-attention with vision synchronization networks for video question answering
Liu, Yun
Zhang, Xiaoming
Huang, Feiran
Shen, Shixun
Tian, Peng
Li, Lang
Li, Zhoujun
PATTERN RECOGNITION, 2022, 132
[25] AutoEval-Video: An Automatic Benchmark for Assessing Large Vision Language Models in Open-Ended Video Question Answering
Chen, Xiuyuan
Lin, Yuan
Zhang, Yuchen
Huang, Weiran
COMPUTER VISION - ECCV 2024, PT XXXVII, 2025, 15095 : 179 - 195
[26] Question-Aware Global-Local Video Understanding Network for Audio-Visual Question Answering
Chen, Zailong
Wang, Lei
Wang, Peng
Gao, Peng
IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, 2024, 34 (05) : 4109 - 4119
[27] Uncovering What, Why and How: A Comprehensive Benchmark for Causation Understanding of Video Anomaly
Du, Hang
Zhang, Sicheng
Xie, Binzhu
Nan, Guoshun
Zhang, Jiayang
Xu, Junrui
Liu, Hangyu
Leng, Sicong
Liu, Jiangming
Fan, Hehe
Huang, Dajiu
Feng, Jing
Chen, Linli
Zhang, Can
Li, Xuhuan
Zhang, Hao
Chen, Jianhang
Cui, Qimei
Tao, Xiaofeng
2024 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2024, : 18793 - 18803
[28] Long-Form Video Question Answering via Dynamic Hierarchical Reinforced Networks
Zhao, Zhou
Zhang, Zhu
Xiao, Shuwen
Xiao, Zhenxin
Yan, Xiaohui
Yu, Jun
Cai, Deng
Wu, Fei
IEEE TRANSACTIONS ON IMAGE PROCESSING, 2019, 28 (12) : 5939 - 5952
[29] Two-Stream Heterogeneous Graph Network with Dynamic Interactive Learning for Video Question Answering
Peng, Min
Shao, Xiaohu
Shi, Yu
Zhou, Xiangdong
2023 INTERNATIONAL JOINT CONFERENCE ON NEURAL NETWORKS, IJCNN, 2023,
[30] A dataset and exploration of models for understanding video data through fill-in-the-blank question-answering
Maharaj, Tegan
Ballas, Nicolas
Rohrbach, Anna
Courville, Aaron
Pal, Christopher
30TH IEEE CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2017), 2017, : 7359 - 7368

← 1 2 3 4 →