NuScenes-QA: A Multi-Modal Visual Question Answering Benchmark for Autonomous Driving Scenario

被引:0
|
作者
Qian, Tianwen [1 ]
Chen, Jingjing [2 ]
Zhuo, Linhai [2 ]
Jiao, Yang [2 ]
Jiang, Yu-Gang [2 ]
机构
[1] Fudan Univ, Acad Engn & Technol, Shanghai, Peoples R China
[2] Fudan Univ, Shanghai Key Lab Intelligent Informat Proc, Sch Comp Sci, Shanghai, Peoples R China
基金
中国国家自然科学基金;
关键词
D O I
暂无
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
We introduce a novel visual question answering (VQA) task in the context of autonomous driving, aiming to answer natural language questions based on street-view clues. Compared to traditional VQA tasks, VQA in autonomous driving scenario presents more challenges. Firstly, the raw visual data are multi-modal, including images and point clouds captured by camera and LiDAR, respectively. Secondly, the data are multi-frame due to the continuous, real-time acquisition. Thirdly, the outdoor scenes exhibit both moving fore-ground and static background. Existing VQA benchmarks fail to adequately address these complexities. To bridge this gap, we propose NuScenes-QA, the first benchmark for VQA in the autonomous driving scenario, encompassing 34K visual scenes and 460K question-answer pairs. Specifically, we leverage existing 3D detection annotations to generate scene graphs and design question templates manually. Sub-sequently, the question-answer pairs are generated programmatically based on these templates. Comprehensive statistics prove that our NuScenes-QA is a balanced large-scale bench-mark with diverse question formats. Built upon it, we develop a series of baselines that employ advanced 3D detection and VQA techniques. Our extensive experiments highlight the challenges posed by this new task. Codes and dataset are available at https://github.com/qiantianwen/NuScenes-QA.
引用
收藏
页码:4542 / 4550
页数:9
相关论文
共 50 条
  • [31] Medical Visual Question-Answering Model Based on Knowledge Enhancement and Multi-Modal Fusion
    Zhang, Dianyuan
    Yu, Chuanming
    An, Lu
    [J]. Proceedings of the Association for Information Science and Technology, 2024, 61 (01) : 703 - 708
  • [32] Deep Multi-modal Object Detection for Autonomous Driving
    Ennajar, Amal
    Khouja, Nadia
    Boutteau, Remi
    Tlili, Fethi
    [J]. 2021 18TH INTERNATIONAL MULTI-CONFERENCE ON SYSTEMS, SIGNALS & DEVICES (SSD), 2021, : 7 - 11
  • [33] Advancing Video Question Answering with a Multi-modal and Multi-layer Question Enhancement Network
    Liu, Meng
    Zhang, Fenglei
    Luo, Xin
    Liu, Fan
    Wei, Yinwei
    Nie, Liqiang
    [J]. PROCEEDINGS OF THE 31ST ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA, MM 2023, 2023, : 3985 - 3993
  • [34] RAMM: Retrieval-augmented Biomedical Visual Question Answering with Multi-modal Pre-training
    Yuan, Zheng
    Jin, Qiao
    Tan, Chuanqi
    Zhao, Zhengyun
    Yuan, Hongyi
    Huang, Fei
    Huang, Songfang
    [J]. PROCEEDINGS OF THE 31ST ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA, MM 2023, 2023, : 547 - 556
  • [35] Pre-Training Multi-Modal Dense Retrievers for Outside-Knowledge Visual Question Answering
    Salemi, Alireza
    Rafiee, Mahta
    Zamani, Hamed
    [J]. PROCEEDINGS OF THE 2023 ACM SIGIR INTERNATIONAL CONFERENCE ON THE THEORY OF INFORMATION RETRIEVAL, ICTIR 2023, 2023, : 169 - 176
  • [36] K-armed Bandit based Multi-modal Network Architecture Search for Visual Question Answering
    Zhou, Yiyi
    Ji, Rongrong
    Sun, Xiaoshuai
    Luo, Gen
    Hong, Xiaopeng
    Su, Jinsong
    Ding, Xinghao
    Shao, Ling
    [J]. MM '20: PROCEEDINGS OF THE 28TH ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA, 2020, : 1245 - 1254
  • [37] TASK-ORIENTED MULTI-MODAL QUESTION ANSWERING FOR COLLABORATIVE APPLICATIONS
    Tan, Hui Li
    Leong, Mei Chee
    Xu, Qianli
    Li, Liyuan
    Fang, Fen
    Cheng, Yi
    Gauthier, Nicolas
    Sun, Ying
    Lim, Joo Iiwee
    [J]. 2020 IEEE INTERNATIONAL CONFERENCE ON IMAGE PROCESSING (ICIP), 2020, : 1426 - 1430
  • [38] MMTF: Multi-Modal Temporal Fusion for Commonsense Video Question Answering
    Ahmad, Mobeen
    Park, Geonwoo
    Park, Dongchan
    Park, Sanguk
    [J]. 2023 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION WORKSHOPS, ICCVW, 2023, : 4659 - 4664
  • [39] Multi-modal Question Answering System Driven by Domain Knowledge Graph
    Zhao, Zhengwei
    Wang, Xiaodong
    Xu, Xiaowei
    Wang, Qing
    [J]. 5TH INTERNATIONAL CONFERENCE ON BIG DATA COMPUTING AND COMMUNICATIONS (BIGCOM 2019), 2019, : 43 - 47
  • [40] Multi-Modal Knowledge-Aware Attention Network for Question Answering
    Zhang Y.
    Qian S.
    Fang Q.
    Xu C.
    [J]. Xu, Changsheng (csxu@nlpr.ia.ac.cn), 1600, Science Press (57): : 1037 - 1045