ANetQA: A Large-scale Benchmark for Fine-grained Compositional Reasoning over Untrimmed Videos

被引:4
|
作者
Yu, Zhou [1 ]
Zheng, Lixiang [1 ]
Zhao, Zhou [2 ]
Wu, Fei [2 ]
Fan, Jianping [1 ,3 ]
Ren, Kui [4 ]
Yu, Jun [1 ]
机构
[1] Hangzhou Dianzi Univ, Sch Comp Sci, Hangzhou, Peoples R China
[2] Zhejiang Univ, Coll Comp Sci & Technol, Hangzhou, Peoples R China
[3] Lenovo Res, AI Lab, Beijing, Peoples R China
[4] Zhejiang Univ, Sch Cyber Sci & Technol, Hangzhou, Peoples R China
关键词
D O I
10.1109/CVPR52729.2023.02221
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Building benchmarks to systemically analyze different capabilities of video question answering (VideoQA) models is challenging yet crucial. Existing benchmarks often use non-compositional simple questions and suffer from language biases, making it difficult to diagnose model weaknesses incisively. A recent benchmark AGQA [8] poses a promising paradigm to generate QA pairs automatically from pre-annotated scene graphs, enabling it to measure diverse reasoning abilities with granular control. However, its questions have limitations in reasoning about the fine-grained semantics in videos as such information is absent in its scene graphs. To this end, we present ANetQA, a large-scale benchmark that supports fine-grained compositional reasoning over the challenging untrimmed videos from ActivityNet [4]. Similar to AGQA, the QA pairs in ANetQA are automatically generated from annotated video scene graphs. The fine-grained properties of ANetQA are reflected in the following: (i) untrimmed videos with fine-grained semantics; (ii) spatio-temporal scene graphs with fine-grained taxonomies; and (iii) diverse questions generated from fine-grained templates. ANetQA attains 1.4 billion unbalanced and 13.4 million balanced QA pairs, which is an order of magnitude larger than AGQA with a similar number of videos. Comprehensive experiments are performed for state-of-the-art methods. The best model achieves 44.5% accuracy while human performance tops out at 84.5%, leaving sufficient room for improvement.
引用
收藏
页码:23191 / 23200
页数:10
相关论文
共 50 条
  • [1] Fine-grained Action Detection in Untrimmed Surveillance Videos
    Aakur, Sathyanarayanan
    Sawyer, Daniel
    Sarkar, Sudeep
    2019 IEEE WINTER APPLICATIONS OF COMPUTER VISION WORKSHOPS (WACVW), 2019, : 38 - 40
  • [2] UrbanBIS: a Large-scale Benchmark for Fine-grained Urban Building Instance Segmentation
    Yang, Guoqing
    Xue, Fuyou
    Zhang, Qi
    Xie, Ke
    Fu, Chi-Wing
    Huang, Hui
    PROCEEDINGS OF SIGGRAPH 2023 CONFERENCE PAPERS, SIGGRAPH 2023, 2023,
  • [3] Benchmarking Large-Scale Fine-Grained Categorization
    Angelova, Anelia
    Long, Philip M.
    2014 IEEE WINTER CONFERENCE ON APPLICATIONS OF COMPUTER VISION (WACV), 2014, : 532 - 539
  • [4] DocEE: A Large-Scale and Fine-grained Benchmark for Document-level Event Extraction
    Tong, Meihan
    Xu, Bin
    Wang, Shuai
    Han, Meihuan
    Cao, Yixin
    Zhu, Jiangqi
    Chen, Siyu
    Hou, Lei
    Li, Juanzi
    NAACL 2022: THE 2022 CONFERENCE OF THE NORTH AMERICAN CHAPTER OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS: HUMAN LANGUAGE TECHNOLOGIES, 2022, : 3970 - 3982
  • [5] A Fine-Grained Large-Scale NAT Detection Method
    Yan, Bin
    Huang, Liang
    Gou, Gaopeng
    Guo, Yuanbo
    Bao, Yibao
    ADVANCED MULTIMEDIA AND UBIQUITOUS ENGINEERING: FUTURETECH & MUE, 2016, 393 : 493 - 499
  • [6] Large-scale instability of a fine-grained turbulent jet
    Chen, KP
    Crighton, DG
    EUROPEAN JOURNAL OF MECHANICS B-FLUIDS, 1999, 18 (01) : 13 - 34
  • [7] GRAND: A large-scale dataset and benchmark for cervical intraepithelial Neoplasia grading with fine-grained lesion description
    Li, Yuexiang
    Liu, Zhi-Hua
    Xue, Peng
    Chen, Jiawei
    Ma, Kai
    Qian, Tianyi
    Zheng, Yefeng
    Qiao, You-Lin
    MEDICAL IMAGE ANALYSIS, 2021, 70
  • [8] GenFace: A Large-Scale Fine-Grained Face Forgery Benchmark and Cross Appearance-Edge Learning
    Zhang, Yaning
    Yu, Zitong
    Wang, Tianyi
    Huang, Xiaobin
    Shen, Linlin
    Gao, Zan
    Ren, Jianfeng
    IEEE TRANSACTIONS ON INFORMATION FORENSICS AND SECURITY, 2024, 19 : 8559 - 8572
  • [9] Fine-grained Transmission Optimization of Large-scale WebVR Scenes
    Yin, Changqing
    Chen, Zhaohui
    Hu, Yonghao
    Yu, Kexin
    PROCEEDINGS OF THE 2018 IEEE INTERNATIONAL CONFERENCE ON PROGRESS IN INFORMATICS AND COMPUTING (PIC), 2018, : 209 - 214
  • [10] Birdsnap: Large-scale Fine-grained Visual Categorization of Birds
    Berg, Thomas
    Liu, Jiongxin
    Lee, Seung Woo
    Alexander, Michelle L.
    Jacobs, David W.
    Belhumeur, Peter N.
    2014 IEEE CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2014, : 2019 - 2026