To Boost Zero-Shot Generalization for Embodied Reasoning With Vision-Language Pre-Training

被引:0
|
作者
Su, Ke [1 ]
Zhang, Xingxing [1 ]
Zhang, Siyang [2 ]
Zhu, Jun [1 ,3 ,4 ]
Zhang, Bo [1 ]
机构
[1] Tsinghua Univ, Inst AI, Tsinghua Bosch Joint ML Ctr, BNRist Ctr,Dept Comp Sci & Technol,THBI Lab, Beijing 100084, Peoples R China
[2] Nankai Univ, Sch Artificial Intelligence, Tianjin 300071, Peoples R China
[3] Peng Cheng Lab, Shenzhen 518066, Peoples R China
[4] Pazhou Lab Huangpu, Guangzhou 510700, Peoples R China
关键词
Cognition; Visualization; Artificial intelligence; Training; Three-dimensional displays; Image reconstruction; Navigation; Embodied artificial intelligence; embodied reasoning; zero-shot generalization; vision-language pre-training;
D O I
10.1109/TIP.2024.3459800
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Recently, there exists an increased research interest in embodied artificial intelligence (EAI), which involves an agent learning to perform a specific task when dynamically interacting with the surrounding 3D environment. There into, a new challenge is that many unseen objects may appear due to the increased number of object categories in 3D scenes. It makes developing models with strong zero-shot generalization ability to new objects necessary. Existing work tries to achieve this goal by providing embodied agents with massive high-quality human annotations closely related to the task to be learned, while it is too costly in practice. Inspired by recent advances in pre-trained models in 2D visual tasks, we attempt to boost zero-shot generalization for embodied reasoning with vision-language pre-training that can encode common sense as general prior knowledge. To further improve its performance on a specific task, we rectify the pre-trained representation through masked scene graph modeling (MSGM) in a self-supervised manner, where the task-specific knowledge is learned from iterative message passing. Our method can improve a variety of representative embodied reasoning tasks by a large margin (e.g., over 5.0% w.r.t. answer accuracy on MP3D-EQA dataset that consists of many real-world scenes with a large number of new objects during testing), and achieve the new state-of-the-art performance.
引用
收藏
页码:5370 / 5381
页数:12
相关论文
共 50 条
  • [1] Multilingual Multimodal Pre-training for Zero-Shot Cross-Lingual Transfer of Vision-Language Models
    Huang, Po-Yao
    Patrick, Mandela
    Hu, Junjie
    Neubig, Graham
    Metze, Florian
    Hauptmann, Alexander
    2021 CONFERENCE OF THE NORTH AMERICAN CHAPTER OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS: HUMAN LANGUAGE TECHNOLOGIES (NAACL-HLT 2021), 2021, : 2443 - 2459
  • [2] EmbodiedGPT: Vision-Language Pre-Training via Embodied Chain of Thought
    Mu, Yao
    Zhang, Qinglong
    Hu, Mengkang
    Wang, Wenhai
    Ding, Mingyu
    Jin, Jun
    Wang, Bin
    Dai, Jifeng
    Qiao, Yu
    Luo, Ping
    ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 36 (NEURIPS 2023), 2023,
  • [3] GroundVLP: Harnessing Zero-Shot Visual Grounding from Vision-Language Pre-training and Open-Vocabulary Object Detection
    Shen, Haozhan
    Zhao, Tiancheng
    Zhu, Mingwei
    Yin, Jianwei
    THIRTY-EIGHTH AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE, VOL 38 NO 5, 2024, : 4766 - 4775
  • [4] Survey on Vision-language Pre-training
    Yin J.
    Zhang Z.-D.
    Gao Y.-H.
    Yang Z.-W.
    Li L.
    Xiao M.
    Sun Y.-Q.
    Yan C.-G.
    Ruan Jian Xue Bao/Journal of Software, 2023, 34 (05): : 2000 - 2023
  • [5] Test-Time Prompt Tuning for Zero-Shot Generalization in Vision-Language Models
    Shu, Manli
    Nie, Weili
    Huang, De-An
    Yu, Zhiding
    Goldstein, Tom
    Anandkumar, Anima
    Xiao, Chaowei
    ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 35 (NEURIPS 2022), 2022,
  • [6] Learning by Correction: Efficient Tuning Task for Zero-Shot Generative Vision-Language Reasoning
    Li, Rongjie
    Wu, Yu
    He, Xuming
    2024 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2024, : 13428 - 13437
  • [7] VLP: A Survey on Vision-language Pre-training
    Chen, Fei-Long
    Zhang, Du-Zhen
    Han, Ming-Lun
    Chen, Xiu-Yi
    Shi, Jing
    Xu, Shuang
    Xu, Bo
    MACHINE INTELLIGENCE RESEARCH, 2023, 20 (01) : 38 - 56
  • [8] VLP: A Survey on Vision-language Pre-training
    Fei-Long Chen
    Du-Zhen Zhang
    Ming-Lun Han
    Xiu-Yi Chen
    Jing Shi
    Shuang Xu
    Bo Xu
    Machine Intelligence Research, 2023, 20 (01) : 38 - 56
  • [9] VLP: A Survey on Vision-language Pre-training
    Fei-Long Chen
    Du-Zhen Zhang
    Ming-Lun Han
    Xiu-Yi Chen
    Jing Shi
    Shuang Xu
    Bo Xu
    Machine Intelligence Research, 2023, 20 : 38 - 56
  • [10] Label Propagation for Zero-shot Classification with Vision-Language Models
    Stojnic, Vladan
    Kalantidis, Yannis
    Tolias, Giorgos
    2024 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2024, : 23209 - 23218