A Dataset for Interactive Vision-Language Navigation with Unknown Command Feasibility

被引：8

作者：

Burns, Andrea ^{[1
]}

Arsan, Deniz ^{[2
]}

Agrawal, Sanjna ^{[1
]}

Kumar, Ranjitha ^{[2
]}

Saenko, Kate ^{[1
,3
]}

Plummer, Bryan A. ^{[1
]}

机构：

[1] Boston Univ, Boston, MA 02215 USA

[2] Univ Illinois, Champaign, IL 61820 USA

[3] MIT IBM Watson AI Lab, Cambridge, MA 02142 USA

来源：

COMPUTER VISION, ECCV 2022, PT VIII | 2022年 / 13668卷

关键词：

Vision-language navigation; Task feasibility; Mobile apps;

D O I：

10.1007/978-3-031-20074-8_18

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

Vision-language navigation (VLN), in which an agent follows language instruction in a visual environment, has been studied under the premise that the input command is fully feasible in the environment. Yet in practice, a request may not be possible due to language ambiguity or environment changes. To study VLN with unknown command feasibility, we introduce a new dataset Mobile app Tasks with Iterative Feedback (MoTIF), where the goal is to complete a natural language command in a mobile app. Mobile apps provide a scalable domain to study real downstream uses of VLN methods. Moreover, mobile app commands provide instruction for interactive navigation, as they result in action sequences with state changes via clicking, typing, or swiping. MoTIF is the first to include feasibility annotations, containing both binary feasibility labels and fine-grained labels for why tasks are unsatisfiable. We further collect follow-up questions for ambiguous queries to enable research on task uncertainty resolution. Equipped with our dataset, we propose the new problem of feasibility prediction, in which a natural language instruction and multimodal app environment are used to predict command feasibility. MoTIF provides a more realistic app dataset as it contains many diverse environments, high-level goals, and longer action sequences than prior work. We evaluate interactive VLN methods using MoTIF, quantify the generalization ability of current approaches to new app environments, and measure the effect of task feasibility on navigation performance.

引用

页码：312 / 328

页数：17

共 50 条

[1] Vision-language navigation: a survey and taxonomy
Wu, Wansen
Chang, Tao
Li, Xinmeng
Yin, Quanjun
Hu, Yue
NEURAL COMPUTING & APPLICATIONS, 2024, 36 (07): : 3291 - 3316
[2] Vision-language navigation: a survey and taxonomy
Wansen Wu
Tao Chang
Xinmeng Li
Quanjun Yin
Yue Hu
Neural Computing and Applications, 2024, 36 : 3291 - 3316
[3] Vision-Language Navigation with Random Environmental Mixup
Liu, Chong
Zhu, Fengda
Chang, Xiaojun
Liang, Xiaodan
Ge, Zongyuan
Shen, Yi-Dong
2021 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2021), 2021, : 1624 - 1634
[4] Vision-Language Navigation Policy Learning and Adaptation
Wang, Xin
Huang, Qiuyuan
Celikyilmaz, Asli
Gao, Jianfeng
Shen, Dinghan
Wang, Yuan-Fang
Wang, William Yang
Zhang, Lei
IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, 2021, 43 (12) : 4205 - 4216
[5] Structured Scene Memory for Vision-Language Navigation
Wang, Hanqing
Wang, Wenguan
Liang, Wei
Xiong, Caiming
Shen, Jianbing
2021 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION, CVPR 2021, 2021, : 8451 - 8460
[6] DREAMWALKER: Mental Planning for Continuous Vision-Language Navigation
Wang, Hanqing
Liang, Wei
Van Gool, Luc
Wang, Wenguan
2023 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2023), 2023, : 10839 - 10849
[7] Vision-Language Navigation Algorithm Based on Cosine Similarity
Jin Jie
Liu Kaiyan
Zha Shunkao
LASER & OPTOELECTRONICS PROGRESS, 2021, 58 (16)
[8] Multimodal Evolutionary Encoder for Continuous Vision-Language Navigation
He, Zongtao
Wang, Liuyi
Chen, Lu
Liu, Shu
Yan, Qingqing
Liu, Chengju
Chen, Qijun
2024 IEEE/RSJ INTERNATIONAL CONFERENCE ON INTELLIGENT ROBOTS AND SYSTEMS, IROS 2024, 2024, : 1443 - 1450
[9] Learning Disentanglement with Decoupled Labels for Vision-Language Navigation
Cheng, Wenhao
Dong, Xingping
Khan, Salman
Shen, Jianbing
COMPUTER VISION, ECCV 2022, PT XXXVI, 2022, 13696 : 309 - 329
[10] Vision-Language Tracking With CLIP and Interactive Prompt Learning
Zhu, Hong
Lu, Qingyang
Xue, Lei
Zhang, Pingping
Yuan, Guanglin
IEEE TRANSACTIONS ON INTELLIGENT TRANSPORTATION SYSTEMS, 2025, 26 (03) : 3659 - 3670

← 1 2 3 4 5 →