Env-QA: A Video Question Answering Benchmark for Comprehensive Understanding of Dynamic Environments

被引:7
|
作者
Gao, Difei [1 ,2 ]
Wang, Ruiping [1 ,2 ,3 ]
Bai, Ziyi [1 ,2 ]
Chen, Xilin [1 ,2 ]
机构
[1] Chinese Acad Sci, Inst Comp Technol, CAS, Key Lab Intelligent Informat Proc, Beijing 100190, Peoples R China
[2] Univ Chinese Acad Sci, Beijing 100049, Peoples R China
[3] Beijing Acad Artificial Intelligence, Beijing 100084, Peoples R China
基金
国家重点研发计划;
关键词
D O I
10.1109/ICCV48922.2021.00170
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Visual understanding goes well beyond the study of images or videos on the web. To achieve complex tasks in volatile situations, the human can deeply understand the environment, quickly perceive events happening around, and continuously track objects' state changes, which are still challenging for current AI systems. To equip AI system with the ability to understand dynamic ENVironments, we build a video Question Answering dataset named Env-QA. Env-QA contains 23K egocentric videos, where each video is composed of a series of events about exploring and interacting in the environment. It also provides 85K questions to evaluate the ability of understanding the composition, layout, and state changes of the environment presented by the events in videos. Moreover, we propose a video QA model, Temporal Segmentation and Event Attention network (TSEA), which introduces event-level video representation and corresponding attention mechanisms to better extract environment information and answer questions. Comprehensive experiments demonstrate the effectiveness of our framework and show the formidable challenges of Env-QA in terms of long-term state tracking, multi-event temporal reasoning and event counting, etc.
引用
收藏
页码:1655 / 1665
页数:11
相关论文
共 35 条
  • [21] ViOCRVQA: novel benchmark dataset and VisionReader for visual question answering by understanding Vietnamese text in images
    Pham, Huy Quang
    Nguyen, Thang Kien-Bao
    Nguyen, Quan Van
    Tran, Dan Quang
    Nguyen, Nghia Hieu
    Nguyen, Kiet Van
    Nguyen, Ngan Luu-Thuy
    MULTIMEDIA SYSTEMS, 2025, 31 (02)
  • [22] Understanding Video Scenes through Text: Insights from Text-based Video Question Answering
    Jahagirdar, Soumya
    Mathew, Minesh
    Karatzas, Dimosthenis
    Jawahar, C. V.
    2023 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION WORKSHOPS, ICCVW, 2023, : 4648 - 4652
  • [23] SUTD-TrafficQA: A Question Answering Benchmark and an Efficient Network for Video Reasoning over Traffic Events
    Xu, Li
    Huang, He
    Liu, Jun
    2021 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION, CVPR 2021, 2021, : 9873 - 9883
  • [24] Dynamic self-attention with vision synchronization networks for video question answering
    Liu, Yun
    Zhang, Xiaoming
    Huang, Feiran
    Shen, Shixun
    Tian, Peng
    Li, Lang
    Li, Zhoujun
    PATTERN RECOGNITION, 2022, 132
  • [25] AutoEval-Video: An Automatic Benchmark for Assessing Large Vision Language Models in Open-Ended Video Question Answering
    Chen, Xiuyuan
    Lin, Yuan
    Zhang, Yuchen
    Huang, Weiran
    COMPUTER VISION - ECCV 2024, PT XXXVII, 2025, 15095 : 179 - 195
  • [26] Question-Aware Global-Local Video Understanding Network for Audio-Visual Question Answering
    Chen, Zailong
    Wang, Lei
    Wang, Peng
    Gao, Peng
    IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, 2024, 34 (05) : 4109 - 4119
  • [27] Uncovering What, Why and How: A Comprehensive Benchmark for Causation Understanding of Video Anomaly
    Du, Hang
    Zhang, Sicheng
    Xie, Binzhu
    Nan, Guoshun
    Zhang, Jiayang
    Xu, Junrui
    Liu, Hangyu
    Leng, Sicong
    Liu, Jiangming
    Fan, Hehe
    Huang, Dajiu
    Feng, Jing
    Chen, Linli
    Zhang, Can
    Li, Xuhuan
    Zhang, Hao
    Chen, Jianhang
    Cui, Qimei
    Tao, Xiaofeng
    2024 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2024, : 18793 - 18803
  • [28] Long-Form Video Question Answering via Dynamic Hierarchical Reinforced Networks
    Zhao, Zhou
    Zhang, Zhu
    Xiao, Shuwen
    Xiao, Zhenxin
    Yan, Xiaohui
    Yu, Jun
    Cai, Deng
    Wu, Fei
    IEEE TRANSACTIONS ON IMAGE PROCESSING, 2019, 28 (12) : 5939 - 5952
  • [29] Two-Stream Heterogeneous Graph Network with Dynamic Interactive Learning for Video Question Answering
    Peng, Min
    Shao, Xiaohu
    Shi, Yu
    Zhou, Xiangdong
    2023 INTERNATIONAL JOINT CONFERENCE ON NEURAL NETWORKS, IJCNN, 2023,
  • [30] A dataset and exploration of models for understanding video data through fill-in-the-blank question-answering
    Maharaj, Tegan
    Ballas, Nicolas
    Rohrbach, Anna
    Courville, Aaron
    Pal, Christopher
    30TH IEEE CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2017), 2017, : 7359 - 7368