A Dataset for Interactive Vision-Language Navigation with Unknown Command Feasibility

被引:8
|
作者
Burns, Andrea [1 ]
Arsan, Deniz [2 ]
Agrawal, Sanjna [1 ]
Kumar, Ranjitha [2 ]
Saenko, Kate [1 ,3 ]
Plummer, Bryan A. [1 ]
机构
[1] Boston Univ, Boston, MA 02215 USA
[2] Univ Illinois, Champaign, IL 61820 USA
[3] MIT IBM Watson AI Lab, Cambridge, MA 02142 USA
来源
关键词
Vision-language navigation; Task feasibility; Mobile apps;
D O I
10.1007/978-3-031-20074-8_18
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Vision-language navigation (VLN), in which an agent follows language instruction in a visual environment, has been studied under the premise that the input command is fully feasible in the environment. Yet in practice, a request may not be possible due to language ambiguity or environment changes. To study VLN with unknown command feasibility, we introduce a new dataset Mobile app Tasks with Iterative Feedback (MoTIF), where the goal is to complete a natural language command in a mobile app. Mobile apps provide a scalable domain to study real downstream uses of VLN methods. Moreover, mobile app commands provide instruction for interactive navigation, as they result in action sequences with state changes via clicking, typing, or swiping. MoTIF is the first to include feasibility annotations, containing both binary feasibility labels and fine-grained labels for why tasks are unsatisfiable. We further collect follow-up questions for ambiguous queries to enable research on task uncertainty resolution. Equipped with our dataset, we propose the new problem of feasibility prediction, in which a natural language instruction and multimodal app environment are used to predict command feasibility. MoTIF provides a more realistic app dataset as it contains many diverse environments, high-level goals, and longer action sequences than prior work. We evaluate interactive VLN methods using MoTIF, quantify the generalization ability of current approaches to new app environments, and measure the effect of task feasibility on navigation performance.
引用
收藏
页码:312 / 328
页数:17
相关论文
共 50 条
  • [31] VL-InterpreT: An Interactive Visualization Tool for Interpreting Vision-Language Transformers
    Aflalo, Estelle
    Du, Meng
    Tseng, Shao-Yen
    Liu, Yongfei
    Wu, Chenfei
    Duan, Nan
    Lal, Vasudev
    2022 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2022), 2022, : 21374 - 21383
  • [32] COM Kitchens: An Unedited Overhead-View Video Dataset as a Vision-Language Benchmark
    Maeda, Koki
    Hirasawa, Tosho
    Hashimoto, Atsushi
    Harashima, Jun
    Rybicki, Leszek
    Fukasawa, Yusuke
    Ushiku, Yoshitaka
    COMPUTER VISION - ECCV 2024, PT LXV, 2025, 15123 : 123 - 140
  • [33] Vision-Language Models for Vision Tasks: A Survey
    Zhang, Jingyi
    Huang, Jiaxing
    Jin, Sheng
    Lu, Shijian
    IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, 2024, 46 (08) : 5625 - 5644
  • [34] Just Ask: An Interactive Learning Framework for Vision and Language Navigation
    Chi, Ta-Chung
    Eric, Mihail
    Kim, Seokhwan
    Shen, Minmin
    Hakkani-tur, Dilek
    THIRTY-FOURTH AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE, THE THIRTY-SECOND INNOVATIVE APPLICATIONS OF ARTIFICIAL INTELLIGENCE CONFERENCE AND THE TENTH AAAI SYMPOSIUM ON EDUCATIONAL ADVANCES IN ARTIFICIAL INTELLIGENCE, 2020, 34 : 2459 - 2466
  • [35] Counterfactual Cycle-Consistent Learning for Instruction Following and Generation in Vision-Language Navigation
    Wang, Hanqing
    Liang, Wei
    Shen, Jianbing
    Van Gool, Luc
    Wang, Wenguan
    2022 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2022), 2022, : 15450 - 15460
  • [36] DICCR: Double-gated intervention and confounder causal reasoning for vision-language navigation
    Zhou, Dongming
    Deng, Jinsheng
    Pang, Zhengbin
    Li, Wei
    NEURAL NETWORKS, 2025, 184
  • [37] DC-CLIP: Multilingual CLIP Compression via vision-language distillation and vision-language alignment
    Zhang, Wenbo
    Zhang, Yifan
    Lin, Jianfeng
    Huang, Binqiang
    Zhang, Jinlu
    Yu, Wenhao
    PATTERN RECOGNITION, 2025, 164
  • [38] Image as a Foreign Language: BEIT Pretraining for Vision and Vision-Language Tasks
    Wang, Wenhui
    Bao, Hangbo
    Dong, Li
    Bjorck, Johan
    Peng, Zhiliang
    Liu, Qiang
    Aggarwal, Kriti
    Mohammed, Owais Khan
    Singhal, Saksham
    Som, Subhojit
    Wei, Furu
    2023 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2023, : 19175 - 19186
  • [39] Masked Vision-language Transformer in Fashion
    Ji, Ge-Peng
    Zhuge, Mingchen
    Gao, Dehong
    Fan, Deng-Ping
    Sakaridis, Christos
    Gool, Luc Van
    MACHINE INTELLIGENCE RESEARCH, 2023, 20 (03) : 421 - 434
  • [40] Masked Vision-language Transformer in Fashion
    Ge-Peng Ji
    Mingchen Zhuge
    Dehong Gao
    Deng-Ping Fan
    Christos Sakaridis
    Luc Van Gool
    Machine Intelligence Research, 2023, 20 : 421 - 434