Env-QA: A Video Question Answering Benchmark for Comprehensive Understanding of Dynamic Environments

被引:7
|
作者
Gao, Difei [1 ,2 ]
Wang, Ruiping [1 ,2 ,3 ]
Bai, Ziyi [1 ,2 ]
Chen, Xilin [1 ,2 ]
机构
[1] Chinese Acad Sci, Inst Comp Technol, CAS, Key Lab Intelligent Informat Proc, Beijing 100190, Peoples R China
[2] Univ Chinese Acad Sci, Beijing 100049, Peoples R China
[3] Beijing Acad Artificial Intelligence, Beijing 100084, Peoples R China
基金
国家重点研发计划;
关键词
D O I
10.1109/ICCV48922.2021.00170
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Visual understanding goes well beyond the study of images or videos on the web. To achieve complex tasks in volatile situations, the human can deeply understand the environment, quickly perceive events happening around, and continuously track objects' state changes, which are still challenging for current AI systems. To equip AI system with the ability to understand dynamic ENVironments, we build a video Question Answering dataset named Env-QA. Env-QA contains 23K egocentric videos, where each video is composed of a series of events about exploring and interacting in the environment. It also provides 85K questions to evaluate the ability of understanding the composition, layout, and state changes of the environment presented by the events in videos. Moreover, we propose a video QA model, Temporal Segmentation and Event Attention network (TSEA), which introduces event-level video representation and corresponding attention mechanisms to better extract environment information and answer questions. Comprehensive experiments demonstrate the effectiveness of our framework and show the formidable challenges of Env-QA in terms of long-term state tracking, multi-event temporal reasoning and event counting, etc.
引用
收藏
页码:1655 / 1665
页数:11
相关论文
共 35 条
  • [1] DISFL-QA: A Benchmark Dataset for Understanding Disfluencies in Question Answering
    Gupta, Aditya
    Xu, Jiacheng
    Upadhyay, Shyam
    Yang, Diyi
    Faruqui, Manaal
    FINDINGS OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS, ACL-IJCNLP 2021, 2021, : 3309 - 3319
  • [2] DFS-QA: Dynamic Frame Selection for Better Video Question Answering
    Ren, Zhibo
    Hou, Baoyu
    Wang, Huizhen
    Zhu, Muhua
    Xiao, Tong
    Zhu, Jingbo
    NATURAL LANGUAGE PROCESSING AND CHINESE COMPUTING, PT III, NLPCC 2024, 2025, 15361 : 420 - 432
  • [3] Open-Vocabulary Video Question Answering: A New Benchmark for Evaluating the Generalizability of Video Question Answering Models
    Ko, Dohwan
    Lee, Ji Soo
    Choi, Miso
    Chu, Jaewon
    Park, Jihwan
    Kim, Hyunwoo J.
    2023 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION, ICCV, 2023, : 3078 - 3089
  • [4] EgoVQA - An Egocentric Video Question Answering Benchmark Dataset
    Fan, Chenyou
    2019 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION WORKSHOPS (ICCVW), 2019, : 4359 - 4366
  • [5] Video Question-Answering Techniques, Benchmark Datasets and Evaluation Metrics Leveraging Video Captioning: A Comprehensive Survey
    Khurana, Khushboo
    Deshpande, Umesh
    IEEE ACCESS, 2021, 9 (09): : 43799 - 43823
  • [6] Towards Video Text Visual Question Answering: Benchmark and Baseline
    Zhao, Minyi
    Li, Bingjia
    Wang, Jie
    Li, Wanqing
    Zhou, Wenjing
    Zhang, Lan
    Xuyang, Shijie
    Yu, Zhihang
    Yu, Xinkun
    Li, Guangze
    Dai, Aobotao
    Zhou, Shuigeng
    ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 35 (NEURIPS 2022), 2022,
  • [7] ECG-QA: A Comprehensive Question Answering Dataset Combined With Electrocardiogram
    Oh, Jungwoo
    Lee, Gyubok
    Bae, Seongsu
    Kwon, Joon-Myoung
    Choi, Edward
    ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 36 (NEURIPS 2023), 2023,
  • [8] TAT-QA: A Question Answering Benchmark on a Hybrid of Tabular and Textual Content in Finance
    Zhu, Fengbin
    Lei, Wenqiang
    Huang, Youcheng
    Wang, Chao
    Zhang, Shuo
    Lv, Jiancheng
    Feng, Fuli
    Chu, Tat-Seng
    59TH ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS AND THE 11TH INTERNATIONAL JOINT CONFERENCE ON NATURAL LANGUAGE PROCESSING (ACL-IJCNLP 2021), VOL 1, 2021, : 3277 - 3287
  • [9] TYDI QA: A Benchmark for Information-Seeking Question Answering in Typologically Diverse Languages
    Clark, Jonathan H.
    Choi, Eunsol
    Collins, Michael
    Garrette, Dan
    Kwiatkowski, Tom
    Nikolaev, Vitaly
    Palomaki, Jennimaria
    TRANSACTIONS OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS, 2020, 8 : 454 - 470
  • [10] Dual Hierarchical Temporal Convolutional Network with QA-Aware Dynamic Normalization for Video Story Question Answering
    Liu, Fei
    Liu, Jing
    Zhu, Xinxin
    Hong, Richang
    Lu, Hanqing
    MM '20: PROCEEDINGS OF THE 28TH ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA, 2020, : 4253 - 4261