SpaceCLIP: A Vision-Language Pretraining Framework With Spatial Reconstruction On Text

被引：1

作者：

Zou, Bo ^{[1
]}

Yang, Chao ^{[2
]}

Quan, Chengbin ^{[1
]}

Zhao, Youjian ^{[1
,3
]}

机构：

[1] Tsinghua Univ, Beijing, Peoples R China

[2] Shanghai AI Lab, Shanghai, Peoples R China

[3] Zhongguancun Lab, Beijing, Peoples R China

来源：

PROCEEDINGS OF THE 31ST ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA, MM 2023 | 2023年

基金：

国家重点研发计划; 北京市自然科学基金;

关键词：

vision-language pretraining; contrastive learning; video retrieval; video question answering;

D O I：

10.1145/3581783.3612379

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

The tremendous progress of vision-to-language retrieval over these years is fueled by contrastive vision-language pretraining (VLP), such as CLIP. Although, contrastive methods do not exhibit the same level of performance on other downstream tasks (e.g., video question answering and natural language grounding). One possible reason is they ignore the misalignment between vision and language, especially the absence of spatial information in language. To mitigate this issue, We start from a new perspective and propose a contrastive VLP framework with spatial reconstruction on text (SpaceCLIP). Specifically, we introduce a unique reconstruction method to assign text representations into the same spatial structure with images or videos and a pretraining objective, SpatialNCE, to reduce the computational overhead and ensure performance on downstream tasks. Empirically, we show SpaceCLIP outperforms other methods with performance gains ranging from 2.1% up to 9.0% on MSRVTT and EgoCLIP multiple-choice questions answering, 2.5% up to 11.0% on EPIC-KITCHENS-100 and MSRVTT multi-instance retrieval, and 0.31% up to 7.2% on Ego4D natural language query benchmark.

引用

页码：519 / 528

页数：10

共 50 条

[1] Generative Negative Text Replay for Continual Vision-Language Pretraining
Yan, Shipeng
Hong, Lanqing
Xu, Hang
Han, Jianhua
Tuytelaars, Tinne
Li, Zhenguo
He, Xuming
COMPUTER VISION, ECCV 2022, PT XXXVI, 2022, 13696 : 22 - 38
[2] Accelerating Vision-Language Pretraining with Free Language Modeling
Wang, Teng
Ge, Yixiao
Zheng, Feng
Cheng, Ran
Shan, Ying
Qie, Xiaohu
Luo, Ping
2023 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2023, : 23161 - 23170
[3] Image as a Foreign Language: BEIT Pretraining for Vision and Vision-Language Tasks
Wang, Wenhui
Bao, Hangbo
Dong, Li
Bjorck, Johan
Peng, Zhiliang
Liu, Qiang
Aggarwal, Kriti
Mohammed, Owais Khan
Singhal, Saksham
Som, Subhojit
Wei, Furu
2023 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2023, : 19175 - 19186
[4] Vision-Language Pretraining for Variable-Shot Image Classification
Papadopoulos, Sotirios
Ioannidis, Konstantinos
Vrochidis, Stefanos
Kompatsiaris, Ioannis
Patras, Ioannis
MULTIMEDIA MODELING, MMM 2025, PT IV, 2025, 15523 : 283 - 297
[5] PyramidCLIP: Hierarchical Feature Alignment for Vision-language Model Pretraining
Gao, Yuting
Liu, Jinfeng
Xu, Zihan
Zhang, Jun
Li, Ke
Shen, Chunhua
ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 35 (NEURIPS 2022), 2022,
[6] Research Progress on Vision-Language Multimodal Pretraining Model Technology
Wang, Huansha
Huang, Ruiyang
Zhang, Jianpeng
ELECTRONICS, 2022, 11 (21)
[7] Can Linguistic Knowledge Improve Multimodal Alignment in Vision-Language Pretraining?
Wang, Fei
Ding, Liang
Rao, Jun
Liu, Ye
Shen, Li
Ding, Changxing
ACM TRANSACTIONS ON MULTIMEDIA COMPUTING COMMUNICATIONS AND APPLICATIONS, 2024, 20 (12)
[8] Single-Stream Multi-level Alignment for Vision-Language Pretraining
Khan, Zaid
Kumar, B. G. Vijay
Yu, Xiang
Schulter, Samuel
Chandraker, Manmohan
Fu, Yun
COMPUTER VISION, ECCV 2022, PT XXXVI, 2022, 13696 : 735 - 751
[9] SELF-SUPERVISED VISION-LANGUAGE PRETRAINING FOR MEDIAL VISUAL QUESTION ANSWERING
Li, Pengfei
Liu, Gang
Tan, Lin
Liao, Jinying
Zhong, Shenjun
2023 IEEE 20TH INTERNATIONAL SYMPOSIUM ON BIOMEDICAL IMAGING, ISBI, 2023,
[10] Improving Medical Vision-Language Contrastive Pretraining With Semantics-Aware Triage
Liu, Bo
Lu, Donghuan
Wei, Dong
Wu, Xian
Wang, Yan
Zhang, Yu
Zheng, Yefeng
IEEE TRANSACTIONS ON MEDICAL IMAGING, 2023, 42 (12) : 3579 - 3589

← 1 2 3 4 5 →