A Dataset for Interactive Vision-Language Navigation with Unknown Command Feasibility

被引:8
|
作者
Burns, Andrea [1 ]
Arsan, Deniz [2 ]
Agrawal, Sanjna [1 ]
Kumar, Ranjitha [2 ]
Saenko, Kate [1 ,3 ]
Plummer, Bryan A. [1 ]
机构
[1] Boston Univ, Boston, MA 02215 USA
[2] Univ Illinois, Champaign, IL 61820 USA
[3] MIT IBM Watson AI Lab, Cambridge, MA 02142 USA
来源
COMPUTER VISION, ECCV 2022, PT VIII | 2022年 / 13668卷
关键词
Vision-language navigation; Task feasibility; Mobile apps;
D O I
10.1007/978-3-031-20074-8_18
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Vision-language navigation (VLN), in which an agent follows language instruction in a visual environment, has been studied under the premise that the input command is fully feasible in the environment. Yet in practice, a request may not be possible due to language ambiguity or environment changes. To study VLN with unknown command feasibility, we introduce a new dataset Mobile app Tasks with Iterative Feedback (MoTIF), where the goal is to complete a natural language command in a mobile app. Mobile apps provide a scalable domain to study real downstream uses of VLN methods. Moreover, mobile app commands provide instruction for interactive navigation, as they result in action sequences with state changes via clicking, typing, or swiping. MoTIF is the first to include feasibility annotations, containing both binary feasibility labels and fine-grained labels for why tasks are unsatisfiable. We further collect follow-up questions for ambiguous queries to enable research on task uncertainty resolution. Equipped with our dataset, we propose the new problem of feasibility prediction, in which a natural language instruction and multimodal app environment are used to predict command feasibility. MoTIF provides a more realistic app dataset as it contains many diverse environments, high-level goals, and longer action sequences than prior work. We evaluate interactive VLN methods using MoTIF, quantify the generalization ability of current approaches to new app environments, and measure the effect of task feasibility on navigation performance.
引用
收藏
页码:312 / 328
页数:17
相关论文
共 50 条
  • [21] Target-Driven Structured Transformer Planner for Vision-Language Navigation
    Zhao, Yusheng
    Chen, Jinyu
    Gao, Chen
    Wang, Wenguan
    Yang, Lirong
    Ren, Haibing
    Xia, Huaxia
    Liu, Si
    PROCEEDINGS OF THE 30TH ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA, MM 2022, 2022, : 4194 - 4203
  • [22] ADAPT: Vision-Language Navigation with Modality-Aligned Action Prompts
    Lin, Bingqian
    Zhu, Yi
    Chen, Zicong
    Liang, Xiwen
    Liu, Jianzhuang
    Liang, Xiaodan
    2022 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2022), 2022, : 15375 - 15385
  • [23] Actional Atomic-Concept Learning for Demystifying Vision-Language Navigation
    Lin, Bingqian
    Zhu, Yi
    Liang, Xiaodan
    Lin, Liang
    Liu, Jianzhuang
    THIRTY-SEVENTH AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE, VOL 37 NO 2, 2023, : 1568 - 1576
  • [24] Lookahead Exploration with Neural Radiance Representation for Continuous Vision-Language Navigation
    Wang, Zihan
    Li, Xiangyang
    Yang, Jiahao
    Liu, Yeqi
    Hu, Junjie
    Jiang, Ming
    Jiang, Shuqiang
    2024 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2024, : 13753 - 13762
  • [25] Adaptive Zone-aware Hierarchical Planner for Vision-Language Navigation
    Gao, Chen
    Peng, Xingyu
    Yan, Mi
    Wang, He
    Yang, Lirong
    Ren, Haibing
    Li, Hongsheng
    Liu, Si
    2023 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2023, : 14911 - 14920
  • [26] FairDeDup: Detecting and Mitigating Vision-Language Fairness Disparities in Semantic Dataset Deduplication
    Slyman, Eric
    Lee, Stefan
    Cohen, Scott
    Kafle, Kushal
    2024 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2024, : 13905 - 13916
  • [27] VLFM: Vision-Language Frontier Maps for Zero-Shot Semantic Navigation
    Yokoyama, Naoki
    Ha, Sehoon
    Batra, Dhruv
    Wang, Jiuguang
    Bucher, Bernadette
    2024 IEEE INTERNATIONAL CONFERENCE ON ROBOTICS AND AUTOMATION, ICRA 2024, 2024, : 42 - 48
  • [28] Transformer-based vision-language alignment for robot navigation and question answering
    Luo, Haonan
    Guo, Ziyu
    Wu, Zhenyu
    Teng, Fei
    Li, Tianrui
    INFORMATION FUSION, 2024, 108
  • [29] Learning to Stop: A Simple yet Effective Approach to Urban Vision-Language Navigation
    Xiang, Jiannan
    Wang, Xin Eric
    Wang, William Yang
    FINDINGS OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS, EMNLP 2020, 2020, : 699 - 707
  • [30] Bird's-Eye-View Scene Graph for Vision-Language Navigation
    Liu, Rui
    Wang, Xiaohan
    Wang, Wenguan
    Yang, Yi
    2023 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2023), 2023, : 10934 - 10946