WebVLN: Vision-and-Language Navigation on Websites

被引:0
|
作者
Chen, Qi [1 ]
Pitawela, Dileepa [1 ]
Zhao, Chongyang [1 ]
Zhou, Gengze [1 ]
Chen, Hsiang-Ting [1 ]
Wu, Qi [1 ]
机构
[1] Univ Adelaide, Australian Inst Machine Learning, Adelaide, SA, Australia
关键词
D O I
暂无
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Vision-and-Language Navigation (VLN) task aims to enable AI agents to accurately understand and follow natural language instructions to navigate through real-world environments, ultimately reaching specific target locations. We recognise a promising opportunity to extend VLN to a comparable navigation task that holds substantial significance in our daily lives, albeit within the virtual realm: navigating websites on the Internet. This paper proposes a new task named Vision-and-Language Navigation on Websites (WebVLN), where we use question-based instructions to train an agent, emulating how users naturally browse websites. Unlike the existing VLN task that only pays attention to vision and instruction (language), the WebVLN agent further considers underlying web-specific content like HTML, which could not be seen on the rendered web pages yet contains rich visual and textual information. Toward this goal, we contribute a dataset, WebVLN-v1, and introduce a novel approach called Website-aware VLN Network (WebVLN-Net), which is built upon the foundation of state-of-the-art VLN techniques. Experimental results show that WebVLN-Net outperforms current VLN and web-related navigation methods. We believe that the introduction of the new WebVLN task and its dataset will establish a new dimension within the VLN domain and contribute to the broader vision-and-language research community. Code is available at: https://github.com/WebVLN/WebVLN.
引用
收藏
页码:1165 / 1173
页数:9
相关论文
共 50 条
  • [21] Speaker-Follower Models for Vision-and-Language Navigation
    Fried, Daniel
    Hu, Ronghang
    Cirik, Volkan
    Rohrbach, Anna
    Andreas, Jacob
    Morency, Louis-Philippe
    Berg-Kirkpatrick, Taylor
    Saenko, Kate
    Klein, Dan
    Darrell, Trevor
    ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 31 (NIPS 2018), 2018, 31
  • [22] Stay on the Path: Instruction Fidelity in Vision-and-Language Navigation
    Jain, Vihan
    Magalhaes, Gabriel
    Ku, Alexander
    Vaswani, Ashish
    Ie, Eugene
    Baldridge, Jason
    57TH ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS (ACL 2019), 2019, : 1862 - 1872
  • [23] Airbert: In-domain Pretraining for Vision-and-Language Navigation
    Guhur, Pierre-Louis
    Tapaswi, Makarand
    Chen, Shizhe
    Laptev, Ivan
    Schmid, Cordelia
    2021 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2021), 2021, : 1614 - 1623
  • [24] DynamicVLN: Incorporating Dynamics into Vision-and-Language Navigation Scenarios
    Department of Electronics and Electrical Engineering, Faculty of Science and Technology, Keio University, 3-14-1, Hiyoshi, Kohoku-ku, Yokohama
    223-8522, Japan
    不详
    305-8560, Japan
    Sensors, 2025, 25 (02)
  • [25] GridMM: Grid Memory Map for Vision-and-Language Navigation
    Wang, Zihan
    Li, Xiangyang
    Yang, Jiahao
    Liu, Yeqi
    Jiang, Shuqiang
    2023 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2023), 2023, : 15579 - 15590
  • [26] KERM: Knowledge Enhanced Reasoning for Vision-and-Language Navigation
    Li, Xiangyang
    Wang, Zihan
    Yang, Jiahao
    Wang, Yaowei
    Jiang, Shuqiang
    2023 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION, CVPR, 2023, : 2583 - 2592
  • [27] Sub-Instruction Aware Vision-and-Language Navigation
    Hong, Yicong
    Rodriguez-Opazo, Cristian
    Wu, Qi
    Gould, Stephen
    PROCEEDINGS OF THE 2020 CONFERENCE ON EMPIRICAL METHODS IN NATURAL LANGUAGE PROCESSING (EMNLP), 2020, : 3360 - 3376
  • [28] Learning Vision-and-Language Navigation from YouTube Videos
    Lin, Kunyang
    Chen, Peihao
    Huang, Diwei
    Li, Thomas H.
    Tan, Mingkui
    Gan, Chuang
    2023 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2023), 2023, : 8283 - 8292
  • [29] NavGPT: Explicit Reasoning in Vision-and-Language Navigation with Large Language Models
    Zhou, Gengze
    Hong, Yicong
    Wu, Qi
    THIRTY-EIGHTH AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE, VOL 38 NO 7, 2024, : 7641 - 7649
  • [30] Are You Looking? Grounding to Multiple Modalities in Vision-and-Language Navigation
    Hu, Ronghang
    Fried, Daniel
    Rohrbach, Anna
    Klein, Dan
    Darrell, Trevor
    Saenko, Kate
    57TH ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS (ACL 2019), 2019, : 6551 - 6557