Are You Looking? Grounding to Multiple Modalities in Vision-and-Language Navigation

被引：0

作者：

Hu, Ronghang ^{[1
]}

Fried, Daniel ^{[1
]}

Rohrbach, Anna ^{[1
]}

Klein, Dan ^{[1
]}

Darrell, Trevor ^{[1
]}

Saenko, Kate ^{[2
]}

机构：

[1] Univ Calif Berkeley, Berkeley, CA 94720 USA

[2] Boston Univ, Boston, MA 02215 USA

来源：

57TH ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS (ACL 2019) | 2019年

关键词：

D O I：

暂无

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

Vision-and-Language Navigation (VLN) requires grounding instructions, such as turn right and stop at the door, to routes in a visual environment. The actual grounding can connect language to the environment through multiple modalities, e.g. stop at the door might ground into visual objects, while turn right might rely only on the geometric structure of a route. We investigate where the natural language empirically grounds under two recent state-of-the-art VLN models. Surprisingly, we discover that visual features may actually hurt these models: models which only use route structure, ablating visual features, outperform their visual counterparts in unseen new environments on the benchmark Room-to-Room dataset. To better use all the available modalities, we propose to decompose the grounding procedure into a set of expert models with access to different modalities (including object detections) and ensemble them at prediction time, improving the performance of state-of-the-art models on the VLN task.

引用

页码：6551 / 6557

页数：7

共 50 条

[21] Scaling Data Generation in Vision-and-Language Navigation
Wang, Zun
Li, Jialu
Hong, Yicong
Wang, Yi
Wu, Qi
Bansal, Mohit
Gould, Stephen
Tan, Hao
Qiao, Yu
2023 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2023), 2023, : 11975 - 11986
[22] AerialVLN (sic) : Vision-and-Language Navigation for UAVs
Liu, Shubo
Zhang, Hongsheng
Qi, Yuankai
Wang, Peng
Zhang, Yanning
Wu, Qi
2023 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2023), 2023, : 15338 - 15348
[23] Vision-and-Language Navigation via Causal Learning
Wang, Liuyi
He, Zongtao
Dang, Ronghao
Shen, Mengjiao
Liu, Chengju
Chen, Qijun
2024 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2024, : 13139 - 13150
[24] Meta-Explore: Exploratory Hierarchical Vision-and-Language Navigation Using Scene Object Spectrum Grounding
Hwang, Minyoung
Jeong, Jaeyeon
Kim, Minsoo
Oh, Yoonseon
Oh, Songhwai
2023 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION, CVPR, 2023, : 6683 - 6693
[25] Reinforced Vision-and-Language Navigation Based on Historical BERT
Zhang, Zixuan
Qi, Shuhan
Zhou, Zihao
Zhang, Jiajia
Yuan, Hao
Wang, Xuan
Wang, Lei
Xiao, Jing
ADVANCES IN SWARM INTELLIGENCE, ICSI 2023, PT II, 2023, 13969 : 427 - 438
[26] History Aware Multimodal Transformer for Vision-and-Language Navigation
Chen, Shizhe
Guhur, Pierre-Louis
Schmid, Cordelia
Laptev, Ivan
ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 34 (NEURIPS 2021), 2021, 34
[27] Diagnosing Vision-and-Language Navigation: What Really Matters
Zhu, Wanrong
Qi, Yuankai
Narayana, Pradyumna
Sone, Kazoo
Basu, Sugato
Wang, Eric Xin
Wu, Qi
Eckstein, Miguel
Wang, William Yang
NAACL 2022: THE 2022 CONFERENCE OF THE NORTH AMERICAN CHAPTER OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS: HUMAN LANGUAGE TECHNOLOGIES, 2022, : 5981 - 5993
[28] Boosting Vision-and-Language Navigation with Direction Guiding and Backtracing
Chen, Jingwen
Luo, Jianjie
Pan, Yingwei
Li, Yehao
Yao, Ting
Chao, Hongyang
Mei, Tao
ACM TRANSACTIONS ON MULTIMEDIA COMPUTING COMMUNICATIONS AND APPLICATIONS, 2023, 19 (01)
[29] Hierarchical Spatial Proximity Reasoning for Vision-and-Language Navigation
Xu, Ming
Xie, Zilong
IEEE ROBOTICS AND AUTOMATION LETTERS, 2024, 9 (12): : 10756 - 10763
[30] Stay on the Path: Instruction Fidelity in Vision-and-Language Navigation
Jain, Vihan
Magalhaes, Gabriel
Ku, Alexander
Vaswani, Ashish
Ie, Eugene
Baldridge, Jason
57TH ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS (ACL 2019), 2019, : 1862 - 1872

← 1 2 3 4 5 →