WebVLN: Vision-and-Language Navigation on Websites

被引:0
|
作者
Chen, Qi [1 ]
Pitawela, Dileepa [1 ]
Zhao, Chongyang [1 ]
Zhou, Gengze [1 ]
Chen, Hsiang-Ting [1 ]
Wu, Qi [1 ]
机构
[1] Univ Adelaide, Australian Inst Machine Learning, Adelaide, SA, Australia
关键词
D O I
暂无
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Vision-and-Language Navigation (VLN) task aims to enable AI agents to accurately understand and follow natural language instructions to navigate through real-world environments, ultimately reaching specific target locations. We recognise a promising opportunity to extend VLN to a comparable navigation task that holds substantial significance in our daily lives, albeit within the virtual realm: navigating websites on the Internet. This paper proposes a new task named Vision-and-Language Navigation on Websites (WebVLN), where we use question-based instructions to train an agent, emulating how users naturally browse websites. Unlike the existing VLN task that only pays attention to vision and instruction (language), the WebVLN agent further considers underlying web-specific content like HTML, which could not be seen on the rendered web pages yet contains rich visual and textual information. Toward this goal, we contribute a dataset, WebVLN-v1, and introduce a novel approach called Website-aware VLN Network (WebVLN-Net), which is built upon the foundation of state-of-the-art VLN techniques. Experimental results show that WebVLN-Net outperforms current VLN and web-related navigation methods. We believe that the introduction of the new WebVLN task and its dataset will establish a new dimension within the VLN domain and contribute to the broader vision-and-language research community. Code is available at: https://github.com/WebVLN/WebVLN.
引用
收藏
页码:1165 / 1173
页数:9
相关论文
共 50 条
  • [41] Frequency-enhanced Data Augmentation for Vision-and-Language Navigation
    He, Keji
    Si, Chenyang
    Lu, Zhihe
    Huang, Yan
    Wang, Liang
    Wang, Xinchao
    ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 36 (NEURIPS 2023), 2023,
  • [42] Vision-and-Language Navigation: Interpreting visually-grounded navigation instructions in real environments
    Anderson, Peter
    Wu, Qi
    Teney, Damien
    Bruce, Jake
    Johnson, Mark
    Sunderhauf, Niko
    Reid, Ian
    Gould, Stephen
    van den Hengel, Anton
    2018 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2018, : 3674 - 3683
  • [43] Multimodal attention networks for low-level vision-and-language navigation
    Landi, Federico
    Baraldi, Lorenzo
    Cornia, Marcella
    Corsini, Massimiliano
    Cucchiara, Rita
    COMPUTER VISION AND IMAGE UNDERSTANDING, 2021, 210
  • [44] Hierarchical Cross-Modal Agent for Robotics Vision-and-Language Navigation
    Irshad, Muhammad Zubair
    Ma, Chih-Yao
    Kira, Zsolt
    2021 IEEE INTERNATIONAL CONFERENCE ON ROBOTICS AND AUTOMATION (ICRA 2021), 2021, : 13238 - 13246
  • [45] Outdoor Vision-and-Language Navigation Needs Object-Level Alignment
    Sun, Yanjun
    Qiu, Yue
    Aoki, Yoshimitsu
    Kataoka, Hirokatsu
    SENSORS, 2023, 23 (13)
  • [46] Multimodal Transformer with Variable-Length Memory for Vision-and-Language Navigation
    Lin, Chuang
    Jiang, Yi
    Cai, Jianfei
    Qu, Lizhen
    Haffari, Gholamreza
    Yuan, Zehuan
    COMPUTER VISION, ECCV 2022, PT XXXVI, 2022, 13696 : 380 - 397
  • [47] BabyWalk: Going Farther in Vision-and-Language Navigation by Taking Baby Steps
    Zhu, Wang
    Hu, Hexiang
    Chen, Jiacheng
    Deng, Zhiwei
    Jain, Vihan
    Ie, Eugene
    Sha, Fei
    58TH ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS (ACL 2020), 2020, : 2539 - 2556
  • [48] Language-Aligned Waypoint (LAW) Supervision for Vision-and-Language Navigation in Continuous Environments
    Raychaudhuri, Sonia
    Wani, Saim
    Patel, Shivansh
    Jain, Unnat
    Chang, Angel X.
    2021 CONFERENCE ON EMPIRICAL METHODS IN NATURAL LANGUAGE PROCESSING (EMNLP 2021), 2021, : 4018 - 4028
  • [49] Auxiliary Fine-grained Alignment Constraints for Vision-and-Language Navigation
    Cui, Yibo
    Huang, Ruqiang
    Zhang, Yakun
    Cen, Yingjie
    Xie, Liang
    Yan, Ye
    Yin, Erwei
    2023 IEEE INTERNATIONAL CONFERENCE ON MULTIMEDIA AND EXPO, ICME, 2023, : 2621 - 2626
  • [50] FOAM: A Follower-aware Speaker Model For Vision-and-Language Navigation
    Dou, Zi-Yi
    Peng, Nanyun
    NAACL 2022: THE 2022 CONFERENCE OF THE NORTH AMERICAN CHAPTER OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS: HUMAN LANGUAGE TECHNOLOGIES, 2022, : 4332 - 4340