Env-QA: A Video Question Answering Benchmark for Comprehensive Understanding of Dynamic Environments

被引:7
|
作者
Gao, Difei [1 ,2 ]
Wang, Ruiping [1 ,2 ,3 ]
Bai, Ziyi [1 ,2 ]
Chen, Xilin [1 ,2 ]
机构
[1] Chinese Acad Sci, Inst Comp Technol, CAS, Key Lab Intelligent Informat Proc, Beijing 100190, Peoples R China
[2] Univ Chinese Acad Sci, Beijing 100049, Peoples R China
[3] Beijing Acad Artificial Intelligence, Beijing 100084, Peoples R China
基金
国家重点研发计划;
关键词
D O I
10.1109/ICCV48922.2021.00170
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Visual understanding goes well beyond the study of images or videos on the web. To achieve complex tasks in volatile situations, the human can deeply understand the environment, quickly perceive events happening around, and continuously track objects' state changes, which are still challenging for current AI systems. To equip AI system with the ability to understand dynamic ENVironments, we build a video Question Answering dataset named Env-QA. Env-QA contains 23K egocentric videos, where each video is composed of a series of events about exploring and interacting in the environment. It also provides 85K questions to evaluate the ability of understanding the composition, layout, and state changes of the environment presented by the events in videos. Moreover, we propose a video QA model, Temporal Segmentation and Event Attention network (TSEA), which introduces event-level video representation and corresponding attention mechanisms to better extract environment information and answer questions. Comprehensive experiments demonstrate the effectiveness of our framework and show the formidable challenges of Env-QA in terms of long-term state tracking, multi-event temporal reasoning and event counting, etc.
引用
收藏
页码:1655 / 1665
页数:11
相关论文
共 35 条
  • [31] WULAI-QA: Web Understanding and Learning with Al towards Document-based Question Answering against COVID-19
    Zhang, Yuan
    Zhang, Xiaoqing
    Hu, Yichuan
    Wang, Guanchun
    Yan, Rui
    WSDM '21: PROCEEDINGS OF THE 14TH ACM INTERNATIONAL CONFERENCE ON WEB SEARCH AND DATA MINING, 2021, : 898 - 901
  • [32] Sports Intelligence: Assessing the Sports Understanding Capabilities of Language Models Through Question Answering from Text to Video
    Yang, Zhengbang
    Xia, Haotian
    Li, Jingxi
    Chen, Zezhi
    Zhu, Zhuangdi
    Shen, Weining
    ELECTRONICS, 2025, 14 (03):
  • [33] Semantic-aware Dynamic Retrospective-Prospective Reasoning for Event-level Video Question Answering
    Lyu, Chenyang
    Ji, Tianbo
    Paragraph, Yvette Graham
    Foster, Jennifer
    PROCEEDINGS OF THE 61ST ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS, ACL-SRW 2023, VOL 4, 2023, : 50 - 56
  • [34] A comprehensive understanding of adaptive thermal comfort in dynamic environments-An interaction matrix-based path analysis modeling framework
    Ming, Ru
    Li, Baizhan
    Du, Chenqiu
    Yu, Wei
    Liu, Hong
    Kosonen, Risto
    Yao, Runming
    ENERGY AND BUILDINGS, 2023, 284
  • [35] ViOCRVQA: novel benchmark dataset and VisionReader for visual question answering by understanding Vietnamese text in imagesViOCRVQA: novel benchmark dataset and VisionReader for visual…\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\ldots $$\end{document}H. Q. Pham et al.
    Huy Quang Pham
    Thang Kien-Bao Nguyen
    Quan Van Nguyen
    Dan Quang Tran
    Nghia Hieu Nguyen
    Kiet Van Nguyen
    Ngan Luu-Thuy Nguyen
    Multimedia Systems, 2025, 31 (2)