To Boost Zero-Shot Generalization for Embodied Reasoning With Vision-Language Pre-Training

被引:0
|
作者
Su, Ke [1 ]
Zhang, Xingxing [1 ]
Zhang, Siyang [2 ]
Zhu, Jun [1 ,3 ,4 ]
Zhang, Bo [1 ]
机构
[1] Tsinghua Univ, Inst AI, Tsinghua Bosch Joint ML Ctr, BNRist Ctr,Dept Comp Sci & Technol,THBI Lab, Beijing 100084, Peoples R China
[2] Nankai Univ, Sch Artificial Intelligence, Tianjin 300071, Peoples R China
[3] Peng Cheng Lab, Shenzhen 518066, Peoples R China
[4] Pazhou Lab Huangpu, Guangzhou 510700, Peoples R China
关键词
Cognition; Visualization; Artificial intelligence; Training; Three-dimensional displays; Image reconstruction; Navigation; Embodied artificial intelligence; embodied reasoning; zero-shot generalization; vision-language pre-training;
D O I
10.1109/TIP.2024.3459800
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Recently, there exists an increased research interest in embodied artificial intelligence (EAI), which involves an agent learning to perform a specific task when dynamically interacting with the surrounding 3D environment. There into, a new challenge is that many unseen objects may appear due to the increased number of object categories in 3D scenes. It makes developing models with strong zero-shot generalization ability to new objects necessary. Existing work tries to achieve this goal by providing embodied agents with massive high-quality human annotations closely related to the task to be learned, while it is too costly in practice. Inspired by recent advances in pre-trained models in 2D visual tasks, we attempt to boost zero-shot generalization for embodied reasoning with vision-language pre-training that can encode common sense as general prior knowledge. To further improve its performance on a specific task, we rectify the pre-trained representation through masked scene graph modeling (MSGM) in a self-supervised manner, where the task-specific knowledge is learned from iterative message passing. Our method can improve a variety of representative embodied reasoning tasks by a large margin (e.g., over 5.0% w.r.t. answer accuracy on MP3D-EQA dataset that consists of many real-world scenes with a large number of new objects during testing), and achieve the new state-of-the-art performance.
引用
收藏
页码:5370 / 5381
页数:12
相关论文
共 50 条
  • [31] Superpixel semantics representation and pre-training for vision-language tasks
    Zhang, Siyu
    Chen, Yeming
    Sun, Yaoru
    Wang, Fang
    Yang, Jun
    Bai, Lizhi
    Gao, Shangce
    NEUROCOMPUTING, 2025, 615
  • [32] Towards Adversarial Attack on Vision-Language Pre-training Models
    Zhang, Jiaming
    Yi, Qi
    Sang, Jitao
    PROCEEDINGS OF THE 30TH ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA, MM 2022, 2022, : 5005 - 5013
  • [33] MAFA: Managing False Negatives for Vision-Language Pre-training
    Byun, Jaeseok
    Kim, Dohoon
    Moon, Taesup
    2024 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2024, : 27304 - 27314
  • [34] Unsupervised Domain Adaption Harnessing Vision-Language Pre-Training
    Zhou, Wenlve
    Zhou, Zhiheng
    IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, 2024, 34 (09) : 8201 - 8214
  • [35] Multimodal Pre-training Method for Vision-language Understanding and Generation
    Liu T.-Y.
    Wu Z.-X.
    Chen J.-J.
    Jiang Y.-G.
    Ruan Jian Xue Bao/Journal of Software, 2023, 34 (05): : 2024 - 2034
  • [36] Unified Vision-Language Pre-Training for Image Captioning and VQA
    Zhou, Luowei
    Palangi, Hamid
    Zhang, Lei
    Hu, Houdong
    Corso, Jason J.
    Gao, Jianfeng
    THIRTY-FOURTH AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE, THE THIRTY-SECOND INNOVATIVE APPLICATIONS OF ARTIFICIAL INTELLIGENCE CONFERENCE AND THE TENTH AAAI SYMPOSIUM ON EDUCATIONAL ADVANCES IN ARTIFICIAL INTELLIGENCE, 2020, 34 : 13041 - 13049
  • [37] Preventing Zero-Shot Transfer Degradation in Continual Learning of Vision-Language Models
    Zheng, Zangwei
    Ma, Mingyuan
    Wang, Kai
    Qin, Ziheng
    Yue, Xiangyu
    You, Yang
    2023 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2023), 2023, : 19068 - 19079
  • [38] EmoCLIP: A Vision-Language Method for Zero-Shot Video Facial Expression Recognition
    Foteinopoulou, Niki Maria
    Patras, Ioannis
    2024 IEEE 18TH INTERNATIONAL CONFERENCE ON AUTOMATIC FACE AND GESTURE RECOGNITION, FG 2024, 2024,
  • [39] Inference Calibration of Vision-Language Foundation Models for Zero-Shot and Few-Shot Learning
    Hu, Minyang
    Chang, Hong
    Shan, Shiguang
    Chen, Xilin
    PATTERN RECOGNITION LETTERS, 2025, 192 : 15 - 21
  • [40] On the test-time zero-shot generalization of vision-language models: Do we really need prompt learning?
    Zanella, Maxime
    Ben Ayed, Ismail
    2024 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2024, : 23783 - 23793