To Boost Zero-Shot Generalization for Embodied Reasoning With Vision-Language Pre-Training

被引：0

作者：

Su, Ke ^{[1
]}

Zhang, Xingxing ^{[1
]}

Zhang, Siyang ^{[2
]}

Zhu, Jun ^{[1
,3
,4
]}

Zhang, Bo ^{[1
]}

机构：

[1] Tsinghua Univ, Inst AI, Tsinghua Bosch Joint ML Ctr, BNRist Ctr,Dept Comp Sci & Technol,THBI Lab, Beijing 100084, Peoples R China

[2] Nankai Univ, Sch Artificial Intelligence, Tianjin 300071, Peoples R China

[3] Peng Cheng Lab, Shenzhen 518066, Peoples R China

[4] Pazhou Lab Huangpu, Guangzhou 510700, Peoples R China

来源：

IEEE TRANSACTIONS ON IMAGE PROCESSING | 2024年 / 33卷

关键词：

Cognition; Visualization; Artificial intelligence; Training; Three-dimensional displays; Image reconstruction; Navigation; Embodied artificial intelligence; embodied reasoning; zero-shot generalization; vision-language pre-training;

D O I：

10.1109/TIP.2024.3459800

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

Recently, there exists an increased research interest in embodied artificial intelligence (EAI), which involves an agent learning to perform a specific task when dynamically interacting with the surrounding 3D environment. There into, a new challenge is that many unseen objects may appear due to the increased number of object categories in 3D scenes. It makes developing models with strong zero-shot generalization ability to new objects necessary. Existing work tries to achieve this goal by providing embodied agents with massive high-quality human annotations closely related to the task to be learned, while it is too costly in practice. Inspired by recent advances in pre-trained models in 2D visual tasks, we attempt to boost zero-shot generalization for embodied reasoning with vision-language pre-training that can encode common sense as general prior knowledge. To further improve its performance on a specific task, we rectify the pre-trained representation through masked scene graph modeling (MSGM) in a self-supervised manner, where the task-specific knowledge is learned from iterative message passing. Our method can improve a variety of representative embodied reasoning tasks by a large margin (e.g., over 5.0% w.r.t. answer accuracy on MP3D-EQA dataset that consists of many real-world scenes with a large number of new objects during testing), and achieve the new state-of-the-art performance.

引用

页码：5370 / 5381

页数：12

共 50 条

[31] Superpixel semantics representation and pre-training for vision-language tasks
Zhang, Siyu
Chen, Yeming
Sun, Yaoru
Wang, Fang
Yang, Jun
Bai, Lizhi
Gao, Shangce
NEUROCOMPUTING, 2025, 615
[32] Towards Adversarial Attack on Vision-Language Pre-training Models
Zhang, Jiaming
Yi, Qi
Sang, Jitao
PROCEEDINGS OF THE 30TH ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA, MM 2022, 2022, : 5005 - 5013
[33] MAFA: Managing False Negatives for Vision-Language Pre-training
Byun, Jaeseok
Kim, Dohoon
Moon, Taesup
2024 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2024, : 27304 - 27314
[34] Unsupervised Domain Adaption Harnessing Vision-Language Pre-Training
Zhou, Wenlve
Zhou, Zhiheng
IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, 2024, 34 (09) : 8201 - 8214
[35] Multimodal Pre-training Method for Vision-language Understanding and Generation
Liu T.-Y.
Wu Z.-X.
Chen J.-J.
Jiang Y.-G.
Ruan Jian Xue Bao/Journal of Software, 2023, 34 (05): : 2024 - 2034
[36] Unified Vision-Language Pre-Training for Image Captioning and VQA
Zhou, Luowei
Palangi, Hamid
Zhang, Lei
Hu, Houdong
Corso, Jason J.
Gao, Jianfeng
THIRTY-FOURTH AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE, THE THIRTY-SECOND INNOVATIVE APPLICATIONS OF ARTIFICIAL INTELLIGENCE CONFERENCE AND THE TENTH AAAI SYMPOSIUM ON EDUCATIONAL ADVANCES IN ARTIFICIAL INTELLIGENCE, 2020, 34 : 13041 - 13049
[37] Preventing Zero-Shot Transfer Degradation in Continual Learning of Vision-Language Models
Zheng, Zangwei
Ma, Mingyuan
Wang, Kai
Qin, Ziheng
Yue, Xiangyu
You, Yang
2023 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2023), 2023, : 19068 - 19079
[38] EmoCLIP: A Vision-Language Method for Zero-Shot Video Facial Expression Recognition
Foteinopoulou, Niki Maria
Patras, Ioannis
2024 IEEE 18TH INTERNATIONAL CONFERENCE ON AUTOMATIC FACE AND GESTURE RECOGNITION, FG 2024, 2024,
[39] Inference Calibration of Vision-Language Foundation Models for Zero-Shot and Few-Shot Learning
Hu, Minyang
Chang, Hong
Shan, Shiguang
Chen, Xilin
PATTERN RECOGNITION LETTERS, 2025, 192 : 15 - 21
[40] On the test-time zero-shot generalization of vision-language models: Do we really need prompt learning?
Zanella, Maxime
Ben Ayed, Ismail
2024 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2024, : 23783 - 23793

← 1 2 3 4 5 →