RLIPv2: Fast Scaling of Relational Language-Image Pre-training

被引:1
|
作者
Yuan, Hangjie [1 ]
Zhang, Shiwei [2 ]
Wang, Xiang [3 ]
Albanie, Samuel [4 ]
Pan, Yining [5 ]
Feng, Tao [2 ]
Jiang, Jianwen [2 ]
Ni, Dong [1 ]
Zhang, Yingya [2 ]
Zhao, Deli [2 ]
机构
[1] Zhejiang Univ, Hangzhou, Peoples R China
[2] Alibaba Grp, Hangzhou, Peoples R China
[3] Huazhong Univ Sci & Technol, Wuhan, Peoples R China
[4] Univ Cambridge, CAML Lab, Cambridge, England
[5] Singapore Univ Technol & Design, Singapore, Singapore
基金
中国国家自然科学基金;
关键词
D O I
10.1109/ICCV51070.2023.01979
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Relational Language-Image Pre-training (RLIP) aims to align vision representations with relational texts, thereby advancing the capability of relational reasoning in computer vision tasks. However, hindered by the slow convergence of RLIPv1(1) architecture and the limited availability of existing scene graph data, scaling RLIPv1 is challenging. In this paper, we propose RLIPv2, a fast converging model that enables the scaling of relational pre-training to large-scale pseudo-labelled scene graph data. To enable fast scaling, RLIPv2 introduces Asymmetric Language-Image Fusion (ALIF), a mechanism that facilitates earlier and deeper gated cross-modal fusion with sparsified language encoding layers. ALIF leads to comparable or better performance than RLIPv1 in a fraction of the time for pre-training and fine-tuning. To obtain scene graph data at scale, we extend object detection datasets with free-form relation labels by introducing a captioner (e.g., BLIP) and a designed Relation Tagger. The Relation Tagger assigns BLIP-generated relation texts to region pairs, thus enabling larger-scale relational pre-training. Through extensive experiments conducted on Human-Object Interaction Detection and Scene Graph Generation, RLIPv2 shows state-of-the-art performance on three benchmarks under fully-finetuning, few-shot and zero-shot settings. Notably, the largest RLIPv2 achieves 23.29mAP on HICO-DET without any fine-tuning, yields 32.22mAP with just 1% data and yields 45.09mAP with 100% data. Code and models are publicly available at https://github.com/JacobYuan7/RLIPv2.
引用
收藏
页码:21592 / 21604
页数:13
相关论文
共 50 条
  • [1] RLIPv2: Fast Scaling of Relational Language-Image Pre-training
    Yuan, Hangjie
    Zhang, Shiwei
    Wang, Xiang
    Albanie, Samuel
    Pan, Yining
    Feng, Tao
    Jiang, Jianwen
    Ni, Dong
    Zhang, Yingya
    Zhao, Deli
    [J]. Proceedings of the IEEE International Conference on Computer Vision, 2023, : 21592 - 21604
  • [2] Scaling Language-Image Pre-training via Masking
    Li, Yanghao
    Fan, Haoqi
    Hu, Ronghang
    Feichtenhofert, Christoph
    He, Kaiming
    [J]. 2023 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2023, : 23390 - 23400
  • [3] Grounded Language-Image Pre-training
    Li, Liunian Harold
    Zhang, Pengchuan
    Zhang, Haotian
    Yang, Jianwei
    Li, Chunyuan
    Zhong, Yiwu
    Wang, Lijuan
    Yuan, Lu
    Zhang, Lei
    Hwang, Jenq-Neng
    Chang, Kai-Wei
    Gao, Jianfeng
    [J]. 2022 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2022, : 10955 - 10965
  • [4] RLIP: Relational Language-Image Pre-training for Human-Object Interaction Detection
    Yuan, Hangjie
    Jiang, Jianwen
    Albanie, Samuel
    Feng, Tao
    Huang, Ziyuan
    Ni, Dong
    Tang, Mingqian
    [J]. Advances in Neural Information Processing Systems, 2022, 35
  • [5] RLIP: Relational Language-Image Pre-training for Human-Object Interaction Detection
    Yuan, Hangjie
    Jiang, Jianwen
    Albanie, Samuel
    Feng, Tao
    Huang, Ziyuan
    Ni, Dong
    Tang, Mingqian
    [J]. ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 35 (NEURIPS 2022), 2022,
  • [6] Contrastive Language-Image Pre-Training with Knowledge Graphs
    Pan, Xuran
    Ye, Tianzhu
    Han, Dongchen
    Song, Shiji
    Huang, Gao
    [J]. ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 35 (NEURIPS 2022), 2022,
  • [7] ALIP: Adaptive Language-Image Pre-training with Synthetic Caption
    Yang, Kaicheng
    Deng, Jiankang
    An, Xiang
    Li, Jiawei
    Feng, Ziyong
    Guo, Jia
    Yang, Jing
    Liu, Tongliang
    [J]. 2023 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION, ICCV, 2023, : 2910 - 2919
  • [8] NLIP: Noise-Robust Language-Image Pre-training
    Huang, Runhui
    Long, Yanxin
    Han, Jianhua
    Xu, Hang
    Liang, Xiwen
    Xu, Chunjing
    Liang, Xiaodan
    [J]. THIRTY-SEVENTH AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE, VOL 37 NO 1, 2023, : 926 - 934
  • [9] UniCLIP: Unified Framework for Contrastive Language-Image Pre-training
    Lee, Janghyeon
    Kim, Jongsuk
    Shon, Hyounguk
    Kim, Bumsoo
    Kim, Seung Hwan
    Lee, Honglak
    Kim, Junmo
    [J]. ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 35 (NEURIPS 2022), 2022,
  • [10] Non-Contrastive Learning Meets Language-Image Pre-Training
    Zhou, Jinghao
    Dong, Li
    Gan, Zhe
    Wang, Lijuan
    Wei, Furu
    [J]. 2023 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2023, : 11028 - 11038