SpaceCLIP: A Vision-Language Pretraining Framework With Spatial Reconstruction On Text

被引:1
|
作者
Zou, Bo [1 ]
Yang, Chao [2 ]
Quan, Chengbin [1 ]
Zhao, Youjian [1 ,3 ]
机构
[1] Tsinghua Univ, Beijing, Peoples R China
[2] Shanghai AI Lab, Shanghai, Peoples R China
[3] Zhongguancun Lab, Beijing, Peoples R China
基金
国家重点研发计划; 北京市自然科学基金;
关键词
vision-language pretraining; contrastive learning; video retrieval; video question answering;
D O I
10.1145/3581783.3612379
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
The tremendous progress of vision-to-language retrieval over these years is fueled by contrastive vision-language pretraining (VLP), such as CLIP. Although, contrastive methods do not exhibit the same level of performance on other downstream tasks (e.g., video question answering and natural language grounding). One possible reason is they ignore the misalignment between vision and language, especially the absence of spatial information in language. To mitigate this issue, We start from a new perspective and propose a contrastive VLP framework with spatial reconstruction on text (SpaceCLIP). Specifically, we introduce a unique reconstruction method to assign text representations into the same spatial structure with images or videos and a pretraining objective, SpatialNCE, to reduce the computational overhead and ensure performance on downstream tasks. Empirically, we show SpaceCLIP outperforms other methods with performance gains ranging from 2.1% up to 9.0% on MSRVTT and EgoCLIP multiple-choice questions answering, 2.5% up to 11.0% on EPIC-KITCHENS-100 and MSRVTT multi-instance retrieval, and 0.31% up to 7.2% on Ego4D natural language query benchmark.
引用
收藏
页码:519 / 528
页数:10
相关论文
共 50 条
  • [1] Generative Negative Text Replay for Continual Vision-Language Pretraining
    Yan, Shipeng
    Hong, Lanqing
    Xu, Hang
    Han, Jianhua
    Tuytelaars, Tinne
    Li, Zhenguo
    He, Xuming
    COMPUTER VISION, ECCV 2022, PT XXXVI, 2022, 13696 : 22 - 38
  • [2] Accelerating Vision-Language Pretraining with Free Language Modeling
    Wang, Teng
    Ge, Yixiao
    Zheng, Feng
    Cheng, Ran
    Shan, Ying
    Qie, Xiaohu
    Luo, Ping
    2023 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2023, : 23161 - 23170
  • [3] Image as a Foreign Language: BEIT Pretraining for Vision and Vision-Language Tasks
    Wang, Wenhui
    Bao, Hangbo
    Dong, Li
    Bjorck, Johan
    Peng, Zhiliang
    Liu, Qiang
    Aggarwal, Kriti
    Mohammed, Owais Khan
    Singhal, Saksham
    Som, Subhojit
    Wei, Furu
    2023 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2023, : 19175 - 19186
  • [4] Vision-Language Pretraining for Variable-Shot Image Classification
    Papadopoulos, Sotirios
    Ioannidis, Konstantinos
    Vrochidis, Stefanos
    Kompatsiaris, Ioannis
    Patras, Ioannis
    MULTIMEDIA MODELING, MMM 2025, PT IV, 2025, 15523 : 283 - 297
  • [5] PyramidCLIP: Hierarchical Feature Alignment for Vision-language Model Pretraining
    Gao, Yuting
    Liu, Jinfeng
    Xu, Zihan
    Zhang, Jun
    Li, Ke
    Shen, Chunhua
    ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 35 (NEURIPS 2022), 2022,
  • [6] Research Progress on Vision-Language Multimodal Pretraining Model Technology
    Wang, Huansha
    Huang, Ruiyang
    Zhang, Jianpeng
    ELECTRONICS, 2022, 11 (21)
  • [7] Can Linguistic Knowledge Improve Multimodal Alignment in Vision-Language Pretraining?
    Wang, Fei
    Ding, Liang
    Rao, Jun
    Liu, Ye
    Shen, Li
    Ding, Changxing
    ACM TRANSACTIONS ON MULTIMEDIA COMPUTING COMMUNICATIONS AND APPLICATIONS, 2024, 20 (12)
  • [8] Single-Stream Multi-level Alignment for Vision-Language Pretraining
    Khan, Zaid
    Kumar, B. G. Vijay
    Yu, Xiang
    Schulter, Samuel
    Chandraker, Manmohan
    Fu, Yun
    COMPUTER VISION, ECCV 2022, PT XXXVI, 2022, 13696 : 735 - 751
  • [9] SELF-SUPERVISED VISION-LANGUAGE PRETRAINING FOR MEDIAL VISUAL QUESTION ANSWERING
    Li, Pengfei
    Liu, Gang
    Tan, Lin
    Liao, Jinying
    Zhong, Shenjun
    2023 IEEE 20TH INTERNATIONAL SYMPOSIUM ON BIOMEDICAL IMAGING, ISBI, 2023,
  • [10] Improving Medical Vision-Language Contrastive Pretraining With Semantics-Aware Triage
    Liu, Bo
    Lu, Donghuan
    Wei, Dong
    Wu, Xian
    Wang, Yan
    Zhang, Yu
    Zheng, Yefeng
    IEEE TRANSACTIONS ON MEDICAL IMAGING, 2023, 42 (12) : 3579 - 3589