A Dataset for Interactive Vision-Language Navigation with Unknown Command Feasibility

被引:8
|
作者
Burns, Andrea [1 ]
Arsan, Deniz [2 ]
Agrawal, Sanjna [1 ]
Kumar, Ranjitha [2 ]
Saenko, Kate [1 ,3 ]
Plummer, Bryan A. [1 ]
机构
[1] Boston Univ, Boston, MA 02215 USA
[2] Univ Illinois, Champaign, IL 61820 USA
[3] MIT IBM Watson AI Lab, Cambridge, MA 02142 USA
来源
关键词
Vision-language navigation; Task feasibility; Mobile apps;
D O I
10.1007/978-3-031-20074-8_18
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Vision-language navigation (VLN), in which an agent follows language instruction in a visual environment, has been studied under the premise that the input command is fully feasible in the environment. Yet in practice, a request may not be possible due to language ambiguity or environment changes. To study VLN with unknown command feasibility, we introduce a new dataset Mobile app Tasks with Iterative Feedback (MoTIF), where the goal is to complete a natural language command in a mobile app. Mobile apps provide a scalable domain to study real downstream uses of VLN methods. Moreover, mobile app commands provide instruction for interactive navigation, as they result in action sequences with state changes via clicking, typing, or swiping. MoTIF is the first to include feasibility annotations, containing both binary feasibility labels and fine-grained labels for why tasks are unsatisfiable. We further collect follow-up questions for ambiguous queries to enable research on task uncertainty resolution. Equipped with our dataset, we propose the new problem of feasibility prediction, in which a natural language instruction and multimodal app environment are used to predict command feasibility. MoTIF provides a more realistic app dataset as it contains many diverse environments, high-level goals, and longer action sequences than prior work. We evaluate interactive VLN methods using MoTIF, quantify the generalization ability of current approaches to new app environments, and measure the effect of task feasibility on navigation performance.
引用
收藏
页码:312 / 328
页数:17
相关论文
共 50 条
  • [1] Vision-language navigation: a survey and taxonomy
    Wu, Wansen
    Chang, Tao
    Li, Xinmeng
    Yin, Quanjun
    Hu, Yue
    NEURAL COMPUTING & APPLICATIONS, 2024, 36 (07): : 3291 - 3316
  • [2] Vision-language navigation: a survey and taxonomy
    Wansen Wu
    Tao Chang
    Xinmeng Li
    Quanjun Yin
    Yue Hu
    Neural Computing and Applications, 2024, 36 : 3291 - 3316
  • [3] Vision-Language Navigation with Random Environmental Mixup
    Liu, Chong
    Zhu, Fengda
    Chang, Xiaojun
    Liang, Xiaodan
    Ge, Zongyuan
    Shen, Yi-Dong
    2021 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2021), 2021, : 1624 - 1634
  • [4] Vision-Language Navigation Policy Learning and Adaptation
    Wang, Xin
    Huang, Qiuyuan
    Celikyilmaz, Asli
    Gao, Jianfeng
    Shen, Dinghan
    Wang, Yuan-Fang
    Wang, William Yang
    Zhang, Lei
    IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, 2021, 43 (12) : 4205 - 4216
  • [5] Structured Scene Memory for Vision-Language Navigation
    Wang, Hanqing
    Wang, Wenguan
    Liang, Wei
    Xiong, Caiming
    Shen, Jianbing
    2021 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION, CVPR 2021, 2021, : 8451 - 8460
  • [6] DREAMWALKER: Mental Planning for Continuous Vision-Language Navigation
    Wang, Hanqing
    Liang, Wei
    Van Gool, Luc
    Wang, Wenguan
    2023 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2023), 2023, : 10839 - 10849
  • [7] Vision-Language Navigation Algorithm Based on Cosine Similarity
    Jin Jie
    Liu Kaiyan
    Zha Shunkao
    LASER & OPTOELECTRONICS PROGRESS, 2021, 58 (16)
  • [8] Multimodal Evolutionary Encoder for Continuous Vision-Language Navigation
    He, Zongtao
    Wang, Liuyi
    Chen, Lu
    Liu, Shu
    Yan, Qingqing
    Liu, Chengju
    Chen, Qijun
    2024 IEEE/RSJ INTERNATIONAL CONFERENCE ON INTELLIGENT ROBOTS AND SYSTEMS, IROS 2024, 2024, : 1443 - 1450
  • [9] Learning Disentanglement with Decoupled Labels for Vision-Language Navigation
    Cheng, Wenhao
    Dong, Xingping
    Khan, Salman
    Shen, Jianbing
    COMPUTER VISION, ECCV 2022, PT XXXVI, 2022, 13696 : 309 - 329
  • [10] Vision-Language Tracking With CLIP and Interactive Prompt Learning
    Zhu, Hong
    Lu, Qingyang
    Xue, Lei
    Zhang, Pingping
    Yuan, Guanglin
    IEEE TRANSACTIONS ON INTELLIGENT TRANSPORTATION SYSTEMS, 2025, 26 (03) : 3659 - 3670