RLIPv2: Fast Scaling of Relational Language-Image Pre-training

被引:1
|
作者
Yuan, Hangjie [1 ]
Zhang, Shiwei [2 ]
Wang, Xiang [3 ]
Albanie, Samuel [4 ]
Pan, Yining [5 ]
Feng, Tao [2 ]
Jiang, Jianwen [2 ]
Ni, Dong [1 ]
Zhang, Yingya [2 ]
Zhao, Deli [2 ]
机构
[1] Zhejiang Univ, Hangzhou, Peoples R China
[2] Alibaba Grp, Hangzhou, Peoples R China
[3] Huazhong Univ Sci & Technol, Wuhan, Peoples R China
[4] Univ Cambridge, CAML Lab, Cambridge, England
[5] Singapore Univ Technol & Design, Singapore, Singapore
基金
中国国家自然科学基金;
关键词
D O I
10.1109/ICCV51070.2023.01979
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Relational Language-Image Pre-training (RLIP) aims to align vision representations with relational texts, thereby advancing the capability of relational reasoning in computer vision tasks. However, hindered by the slow convergence of RLIPv1(1) architecture and the limited availability of existing scene graph data, scaling RLIPv1 is challenging. In this paper, we propose RLIPv2, a fast converging model that enables the scaling of relational pre-training to large-scale pseudo-labelled scene graph data. To enable fast scaling, RLIPv2 introduces Asymmetric Language-Image Fusion (ALIF), a mechanism that facilitates earlier and deeper gated cross-modal fusion with sparsified language encoding layers. ALIF leads to comparable or better performance than RLIPv1 in a fraction of the time for pre-training and fine-tuning. To obtain scene graph data at scale, we extend object detection datasets with free-form relation labels by introducing a captioner (e.g., BLIP) and a designed Relation Tagger. The Relation Tagger assigns BLIP-generated relation texts to region pairs, thus enabling larger-scale relational pre-training. Through extensive experiments conducted on Human-Object Interaction Detection and Scene Graph Generation, RLIPv2 shows state-of-the-art performance on three benchmarks under fully-finetuning, few-shot and zero-shot settings. Notably, the largest RLIPv2 achieves 23.29mAP on HICO-DET without any fine-tuning, yields 32.22mAP with just 1% data and yields 45.09mAP with 100% data. Code and models are publicly available at https://github.com/JacobYuan7/RLIPv2.
引用
下载
收藏
页码:21592 / 21604
页数:13
相关论文
共 50 条
  • [21] PMC-CLIP: Contrastive Language-Image Pre-training Using Biomedical Documents
    Lin, Weixiong
    Zhao, Ziheng
    Zhang, Xiaoman
    Wu, Chaoyi
    Zhang, Ya
    Wang, Yanfeng
    Xie, Weidi
    MEDICAL IMAGE COMPUTING AND COMPUTER ASSISTED INTERVENTION, MICCAI 2023, PT VIII, 2023, 14227 : 525 - 536
  • [22] Learning Visual Representation from Modality-Shared Contrastive Language-Image Pre-training
    You, Haoxuan
    Zhou, Luowei
    Xiao, Bin
    Codella, Noel
    Cheng, Yu
    Xu, Ruochen
    Chang, Shih-Fu
    Yuan, Lu
    COMPUTER VISION - ECCV 2022, PT XXVII, 2022, 13687 : 69 - 87
  • [23] Scaling Up Vision-Language Pre-training for Image Captioning
    Hu, Xiaowei
    Gan, Zhe
    Wang, Jianfeng
    Yang, Zhengyuan
    Liu, Zicheng
    Lu, Yumao
    Wang, Lijuan
    2022 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2022), 2022, : 17959 - 17968
  • [24] MedKLIP: Medical Knowledge Enhanced Language-Image Pre-Training for X-ray Diagnosis
    Wu, Chaoyi
    Zhang, Xiaoman
    Zhang, Ya
    Wang, Yanfeng
    Xie, Weidi
    2023 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2023), 2023, : 21315 - 21326
  • [25] Assessing the Social Skills of Children with Autism Spectrum Disorder via Language-Image Pre-training Models
    Liu, Wenxing
    Cheng, Ming
    Pan, Yueran
    Yuan, Lynn
    Hu, Suxiu
    Li, Ming
    Zeng, Songtian
    PATTERN RECOGNITION AND COMPUTER VISION, PRCV 2023, PT XIII, 2024, 14437 : 260 - 271
  • [26] CLIP-FG:SELECTING DISCRIMINATIVE IMAGE PATCHES BY CONTRASTIVE LANGUAGE-IMAGE PRE-TRAINING FOR FINE-GRAINED IMAGE CLASSIFICATION
    Yuan, Min
    Lv, Ningning
    Xie, Yufei
    Lu, Fuxiang
    Zhan, Kun
    2023 IEEE INTERNATIONAL CONFERENCE ON IMAGE PROCESSING, ICIP, 2023, : 560 - 564
  • [27] CXR-CLIP: Toward Large Scale Chest X-ray Language-Image Pre-training
    You, Kihyun
    Gu, Jawook
    Ham, Jiyeon
    Park, Beomhee
    Kim, Jiho
    Hong, Eun K.
    Baek, Woonhyuk
    Roh, Byungseok
    MEDICAL IMAGE COMPUTING AND COMPUTER ASSISTED INTERVENTION, MICCAI 2023, PT II, 2023, 14221 : 101 - 111
  • [28] Sigmoid Loss for Language Image Pre-Training
    Zhai, Xiaohua
    Mustafa, Basil
    Kolesnikov, Alexander
    Beyer, Lucas
    2023 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2023), 2023, : 11941 - 11952
  • [29] LexLIP: Lexicon-Bottlenecked Language-Image Pre-Training for Large-Scale Image-Text Sparse Retrieval
    Luo, Ziyang
    Zhao, Pu
    Xu, Can
    Geng, Xiubo
    Shen, Tao
    Tao, Chongyang
    Ma, Jing
    Lin, Qingwei
    Jiang, Daxin
    2023 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2023), 2023, : 11172 - 11183
  • [30] GrowCLIP: Data-aware Automatic Model Growing for Large-scale Contrastive Language-Image Pre-training
    Deng, Xinchi
    Shi, Han
    Huang, Runhui
    Li, Changlin
    Xu, Hang
    Han, Jianhua
    Kwok, James
    Zhao, Shen
    Zhang, Wei
    Liang, Xiaodan
    2023 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2023), 2023, : 22121 - 22132