NuScenes-QA: A Multi-Modal Visual Question Answering Benchmark for Autonomous Driving Scenario

被引：0

作者：

Qian, Tianwen ^{[1
]}

Chen, Jingjing ^{[2
]}

Zhuo, Linhai ^{[2
]}

Jiao, Yang ^{[2
]}

Jiang, Yu-Gang ^{[2
]}

机构：

[1] Fudan Univ, Acad Engn & Technol, Shanghai, Peoples R China

[2] Fudan Univ, Shanghai Key Lab Intelligent Informat Proc, Sch Comp Sci, Shanghai, Peoples R China

来源：

THIRTY-EIGHTH AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE, VOL 38 NO 5 | 2024年

基金：

中国国家自然科学基金;

关键词：

D O I：

暂无

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

We introduce a novel visual question answering (VQA) task in the context of autonomous driving, aiming to answer natural language questions based on street-view clues. Compared to traditional VQA tasks, VQA in autonomous driving scenario presents more challenges. Firstly, the raw visual data are multi-modal, including images and point clouds captured by camera and LiDAR, respectively. Secondly, the data are multi-frame due to the continuous, real-time acquisition. Thirdly, the outdoor scenes exhibit both moving fore-ground and static background. Existing VQA benchmarks fail to adequately address these complexities. To bridge this gap, we propose NuScenes-QA, the first benchmark for VQA in the autonomous driving scenario, encompassing 34K visual scenes and 460K question-answer pairs. Specifically, we leverage existing 3D detection annotations to generate scene graphs and design question templates manually. Sub-sequently, the question-answer pairs are generated programmatically based on these templates. Comprehensive statistics prove that our NuScenes-QA is a balanced large-scale bench-mark with diverse question formats. Built upon it, we develop a series of baselines that employ advanced 3D detection and VQA techniques. Our extensive experiments highlight the challenges posed by this new task. Codes and dataset are available at https://github.com/qiantianwen/NuScenes-QA.

引用

页码：4542 / 4550

页数：9

共 50 条

[31] Medical Visual Question-Answering Model Based on Knowledge Enhancement and Multi-Modal Fusion
Zhang, Dianyuan
Yu, Chuanming
An, Lu
[J]. Proceedings of the Association for Information Science and Technology, 2024, 61 (01) : 703 - 708
[32] Deep Multi-modal Object Detection for Autonomous Driving
Ennajar, Amal
Khouja, Nadia
Boutteau, Remi
Tlili, Fethi
[J]. 2021 18TH INTERNATIONAL MULTI-CONFERENCE ON SYSTEMS, SIGNALS & DEVICES (SSD), 2021, : 7 - 11
[33] Advancing Video Question Answering with a Multi-modal and Multi-layer Question Enhancement Network
Liu, Meng
Zhang, Fenglei
Luo, Xin
Liu, Fan
Wei, Yinwei
Nie, Liqiang
[J]. PROCEEDINGS OF THE 31ST ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA, MM 2023, 2023, : 3985 - 3993
[34] RAMM: Retrieval-augmented Biomedical Visual Question Answering with Multi-modal Pre-training
Yuan, Zheng
Jin, Qiao
Tan, Chuanqi
Zhao, Zhengyun
Yuan, Hongyi
Huang, Fei
Huang, Songfang
[J]. PROCEEDINGS OF THE 31ST ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA, MM 2023, 2023, : 547 - 556
[35] Pre-Training Multi-Modal Dense Retrievers for Outside-Knowledge Visual Question Answering
Salemi, Alireza
Rafiee, Mahta
Zamani, Hamed
[J]. PROCEEDINGS OF THE 2023 ACM SIGIR INTERNATIONAL CONFERENCE ON THE THEORY OF INFORMATION RETRIEVAL, ICTIR 2023, 2023, : 169 - 176
[36] K-armed Bandit based Multi-modal Network Architecture Search for Visual Question Answering
Zhou, Yiyi
Ji, Rongrong
Sun, Xiaoshuai
Luo, Gen
Hong, Xiaopeng
Su, Jinsong
Ding, Xinghao
Shao, Ling
[J]. MM '20: PROCEEDINGS OF THE 28TH ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA, 2020, : 1245 - 1254
[37] TASK-ORIENTED MULTI-MODAL QUESTION ANSWERING FOR COLLABORATIVE APPLICATIONS
Tan, Hui Li
Leong, Mei Chee
Xu, Qianli
Li, Liyuan
Fang, Fen
Cheng, Yi
Gauthier, Nicolas
Sun, Ying
Lim, Joo Iiwee
[J]. 2020 IEEE INTERNATIONAL CONFERENCE ON IMAGE PROCESSING (ICIP), 2020, : 1426 - 1430
[38] MMTF: Multi-Modal Temporal Fusion for Commonsense Video Question Answering
Ahmad, Mobeen
Park, Geonwoo
Park, Dongchan
Park, Sanguk
[J]. 2023 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION WORKSHOPS, ICCVW, 2023, : 4659 - 4664
[39] Multi-modal Question Answering System Driven by Domain Knowledge Graph
Zhao, Zhengwei
Wang, Xiaodong
Xu, Xiaowei
Wang, Qing
[J]. 5TH INTERNATIONAL CONFERENCE ON BIG DATA COMPUTING AND COMMUNICATIONS (BIGCOM 2019), 2019, : 43 - 47
[40] Multi-Modal Knowledge-Aware Attention Network for Question Answering
Zhang Y.
Qian S.
Fang Q.
Xu C.
[J]. Xu, Changsheng (csxu@nlpr.ia.ac.cn), 1600, Science Press (57): : 1037 - 1045

← 1 2 3 4 5 →