A Dataset for Interactive Vision-Language Navigation with Unknown Command Feasibility

被引：8

作者：

Burns, Andrea ^{[1
]}

Arsan, Deniz ^{[2
]}

Agrawal, Sanjna ^{[1
]}

Kumar, Ranjitha ^{[2
]}

Saenko, Kate ^{[1
,3
]}

Plummer, Bryan A. ^{[1
]}

机构：

[1] Boston Univ, Boston, MA 02215 USA

[2] Univ Illinois, Champaign, IL 61820 USA

[3] MIT IBM Watson AI Lab, Cambridge, MA 02142 USA

来源：

COMPUTER VISION, ECCV 2022, PT VIII | 2022年 / 13668卷

关键词：

Vision-language navigation; Task feasibility; Mobile apps;

D O I：

10.1007/978-3-031-20074-8_18

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

Vision-language navigation (VLN), in which an agent follows language instruction in a visual environment, has been studied under the premise that the input command is fully feasible in the environment. Yet in practice, a request may not be possible due to language ambiguity or environment changes. To study VLN with unknown command feasibility, we introduce a new dataset Mobile app Tasks with Iterative Feedback (MoTIF), where the goal is to complete a natural language command in a mobile app. Mobile apps provide a scalable domain to study real downstream uses of VLN methods. Moreover, mobile app commands provide instruction for interactive navigation, as they result in action sequences with state changes via clicking, typing, or swiping. MoTIF is the first to include feasibility annotations, containing both binary feasibility labels and fine-grained labels for why tasks are unsatisfiable. We further collect follow-up questions for ambiguous queries to enable research on task uncertainty resolution. Equipped with our dataset, we propose the new problem of feasibility prediction, in which a natural language instruction and multimodal app environment are used to predict command feasibility. MoTIF provides a more realistic app dataset as it contains many diverse environments, high-level goals, and longer action sequences than prior work. We evaluate interactive VLN methods using MoTIF, quantify the generalization ability of current approaches to new app environments, and measure the effect of task feasibility on navigation performance.

引用

页码：312 / 328

页数：17

共 50 条

[21] Target-Driven Structured Transformer Planner for Vision-Language Navigation
Zhao, Yusheng
Chen, Jinyu
Gao, Chen
Wang, Wenguan
Yang, Lirong
Ren, Haibing
Xia, Huaxia
Liu, Si
PROCEEDINGS OF THE 30TH ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA, MM 2022, 2022, : 4194 - 4203
[22] ADAPT: Vision-Language Navigation with Modality-Aligned Action Prompts
Lin, Bingqian
Zhu, Yi
Chen, Zicong
Liang, Xiwen
Liu, Jianzhuang
Liang, Xiaodan
2022 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2022), 2022, : 15375 - 15385
[23] Actional Atomic-Concept Learning for Demystifying Vision-Language Navigation
Lin, Bingqian
Zhu, Yi
Liang, Xiaodan
Lin, Liang
Liu, Jianzhuang
THIRTY-SEVENTH AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE, VOL 37 NO 2, 2023, : 1568 - 1576
[24] Lookahead Exploration with Neural Radiance Representation for Continuous Vision-Language Navigation
Wang, Zihan
Li, Xiangyang
Yang, Jiahao
Liu, Yeqi
Hu, Junjie
Jiang, Ming
Jiang, Shuqiang
2024 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2024, : 13753 - 13762
[25] Adaptive Zone-aware Hierarchical Planner for Vision-Language Navigation
Gao, Chen
Peng, Xingyu
Yan, Mi
Wang, He
Yang, Lirong
Ren, Haibing
Li, Hongsheng
Liu, Si
2023 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2023, : 14911 - 14920
[26] FairDeDup: Detecting and Mitigating Vision-Language Fairness Disparities in Semantic Dataset Deduplication
Slyman, Eric
Lee, Stefan
Cohen, Scott
Kafle, Kushal
2024 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2024, : 13905 - 13916
[27] VLFM: Vision-Language Frontier Maps for Zero-Shot Semantic Navigation
Yokoyama, Naoki
Ha, Sehoon
Batra, Dhruv
Wang, Jiuguang
Bucher, Bernadette
2024 IEEE INTERNATIONAL CONFERENCE ON ROBOTICS AND AUTOMATION, ICRA 2024, 2024, : 42 - 48
[28] Transformer-based vision-language alignment for robot navigation and question answering
Luo, Haonan
Guo, Ziyu
Wu, Zhenyu
Teng, Fei
Li, Tianrui
INFORMATION FUSION, 2024, 108
[29] Learning to Stop: A Simple yet Effective Approach to Urban Vision-Language Navigation
Xiang, Jiannan
Wang, Xin Eric
Wang, William Yang
FINDINGS OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS, EMNLP 2020, 2020, : 699 - 707
[30] Bird's-Eye-View Scene Graph for Vision-Language Navigation
Liu, Rui
Wang, Xiaohan
Wang, Wenguan
Yang, Yi
2023 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2023), 2023, : 10934 - 10946

← 1 2 3 4 5 →