To Boost Zero-Shot Generalization for Embodied Reasoning With Vision-Language Pre-Training

被引:0
|
作者
Su, Ke [1 ]
Zhang, Xingxing [1 ]
Zhang, Siyang [2 ]
Zhu, Jun [1 ,3 ,4 ]
Zhang, Bo [1 ]
机构
[1] Tsinghua Univ, Inst AI, Tsinghua Bosch Joint ML Ctr, BNRist Ctr,Dept Comp Sci & Technol,THBI Lab, Beijing 100084, Peoples R China
[2] Nankai Univ, Sch Artificial Intelligence, Tianjin 300071, Peoples R China
[3] Peng Cheng Lab, Shenzhen 518066, Peoples R China
[4] Pazhou Lab Huangpu, Guangzhou 510700, Peoples R China
关键词
Cognition; Visualization; Artificial intelligence; Training; Three-dimensional displays; Image reconstruction; Navigation; Embodied artificial intelligence; embodied reasoning; zero-shot generalization; vision-language pre-training;
D O I
10.1109/TIP.2024.3459800
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Recently, there exists an increased research interest in embodied artificial intelligence (EAI), which involves an agent learning to perform a specific task when dynamically interacting with the surrounding 3D environment. There into, a new challenge is that many unseen objects may appear due to the increased number of object categories in 3D scenes. It makes developing models with strong zero-shot generalization ability to new objects necessary. Existing work tries to achieve this goal by providing embodied agents with massive high-quality human annotations closely related to the task to be learned, while it is too costly in practice. Inspired by recent advances in pre-trained models in 2D visual tasks, we attempt to boost zero-shot generalization for embodied reasoning with vision-language pre-training that can encode common sense as general prior knowledge. To further improve its performance on a specific task, we rectify the pre-trained representation through masked scene graph modeling (MSGM) in a self-supervised manner, where the task-specific knowledge is learned from iterative message passing. Our method can improve a variety of representative embodied reasoning tasks by a large margin (e.g., over 5.0% w.r.t. answer accuracy on MP3D-EQA dataset that consists of many real-world scenes with a large number of new objects during testing), and achieve the new state-of-the-art performance.
引用
收藏
页码:5370 / 5381
页数:12
相关论文
共 50 条
  • [21] Zero-Shot Object Counting With Vision-Language Prior Guidance Network
    Zhai, Wenzhe
    Xing, Xianglei
    Gao, Mingliang
    Li, Qilei
    IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, 2025, 35 (03) : 2487 - 2498
  • [22] Zero-Shot Temporal Action Detection via Vision-Language Prompting
    Nag, Sauradip
    Zhu, Xiatian
    Song, Yi-Zhe
    Xiang, Tao
    COMPUTER VISION - ECCV 2022, PT III, 2022, 13663 : 681 - 697
  • [23] Label Agnostic Pre-training for Zero-shot Text Classification
    Clarke, Christopher
    Heng, Yuzhao
    Kang, Yiping
    Flautner, Krisztian
    Tang, Lingjia
    Mars, Jason
    FINDINGS OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS, ACL 2023, 2023, : 1009 - 1021
  • [24] Learning by Hallucinating: Vision-Language Pre-training with Weak Supervision
    Wang, Tzu-Jui Julius
    Laaksonen, Jorma
    Langer, Tomas
    Arponen, Heikki
    Bishop, Tom E.
    2023 IEEE/CVF WINTER CONFERENCE ON APPLICATIONS OF COMPUTER VISION (WACV), 2023, : 1073 - 1083
  • [25] Enhancing Dynamic Image Advertising with Vision-Language Pre-training
    Wen, Zhoufutu
    Zhao, Xinyu
    Jin, Zhipeng
    Yang, Yi
    Jia, Wei
    Chen, Xiaodong
    Li, Shuanglong
    Liu, Lin
    PROCEEDINGS OF THE 46TH INTERNATIONAL ACM SIGIR CONFERENCE ON RESEARCH AND DEVELOPMENT IN INFORMATION RETRIEVAL, SIGIR 2023, 2023, : 3310 - 3314
  • [26] Filtering, Distillation, and Hard Negatives for Vision-Language Pre-Training
    Radenovic, Filip
    Dubey, Abhimanyu
    Kadian, Abhishek
    Mihaylov, Todor
    Vandenhende, Simon
    Patel, Yash
    Wen, Yi
    Ramanathan, Vignesh
    Mahajan, Dhruv
    2023 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION, CVPR, 2023, : 6967 - 6977
  • [27] Transferable Multimodal Attack on Vision-Language Pre-training Models
    Wang, Haodi
    Dong, Kai
    Zhu, Zhilei
    Qin, Haotong
    Liu, Aishan
    Fang, Xiaolin
    Wang, Jiakai
    Liu, Xianglong
    45TH IEEE SYMPOSIUM ON SECURITY AND PRIVACY, SP 2024, 2024, : 1722 - 1740
  • [28] Vision-Language Pre-Training for Boosting Scene Text Detectors
    Song, Sibo
    Wan, Jianqiang
    Yang, Zhibo
    Tang, Jun
    Cheng, Wenqing
    Bai, Xiang
    Yao, Cong
    2022 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2022), 2022, : 15660 - 15670
  • [29] Too Large; Data Reduction for Vision-Language Pre-Training
    Wang, Alex Jinpeng
    Lin, Kevin Qinghong
    Zhang, David Junhao
    Lei, Stan Weixian
    Shou, Mike Zheng
    2023 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION, ICCV, 2023, : 3124 - 3134
  • [30] Scaling Up Vision-Language Pre-training for Image Captioning
    Hu, Xiaowei
    Gan, Zhe
    Wang, Jianfeng
    Yang, Zhengyuan
    Liu, Zicheng
    Lu, Yumao
    Wang, Lijuan
    2022 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2022), 2022, : 17959 - 17968