Dynamic Contrastive Distillation for Image-Text Retrieval

被引:8
|
作者
Rao, Jun [1 ]
Ding, Liang [4 ]
Qi, Shuhan [2 ,3 ]
Fang, Meng [5 ]
Liu, Yang [1 ]
Shen, Li [4 ]
Tao, Dacheng [4 ]
机构
[1] Harbin Inst Technol, Shenzhen 518055, Peoples R China
[2] Harbin Inst Technol Shenzhen, Peng Cheng Lab, Shenzhen 518055, Peoples R China
[3] Guangdong Prov Key Lab Novel Secur Intelligence Te, Shenzhen 518055, Peoples R China
[4] JD Explore Acad JD Com, Beijing 101111, Peoples R China
[5] Univ Liverpool, Liverpool L69 3BX, England
关键词
Cross-modal retrieval; neural networks; contrastive learning; ROBUST;
D O I
10.1109/TMM.2023.3236837
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
The recent advancement in vision-and-language pretraining (VLP) has significantly improved the performance of cross-modal image-text retrieval (ITR) systems. However, the increasing size of VLP models presents a challenge for real-world deployment due to their high latency, making them unsuitable for practical search scenarios. To alleviate this problem, we present a novel plug-in dynamic contrastive distillation (DCD) framework to compress the large VLP models for the ITR task. Technically, we face the following two challenges: 1) the typical uni-modal metric learning approach is difficult to directly apply to cross-modal tasks due to the limited GPU memory to optimize too many negative samples during handling cross-modal fusion features. 2) it is inefficient to static optimize the student network from different hard samples, which affects distillation learning and student network optimization. We propose a method for multi-modal contrastive learning that balances training costs and effects. Our approach involves using a teacher network to identify hard samples for student networks to learn from, allowing the students to leverage the knowledge from pre-trained teachers and effectively learn from hard samples. To learn from hard sample pairs, we propose dynamic distillation to dynamically learn samples of different difficulties to balance better the difficulty of knowledge and students' self-learning ability. We successfully apply our proposed DCD strategy on two state-of-the-art vision-language pretrained models, i.e., ViLT and METER. Extensive experiments on MS-COCO and Flickr30K benchmarks show the effectiveness and efficiency of our DCD framework. We further provide in-depth analyses and discussions that explain how the performance improves.
引用
收藏
页码:8383 / 8395
页数:13
相关论文
共 50 条
  • [1] Image-Text Cross-Modal Retrieval with Instance Contrastive Embedding
    Zeng, Ruigeng
    Ma, Wentao
    Wu, Xiaoqian
    Liu, Wei
    Liu, Jie
    [J]. ELECTRONICS, 2024, 13 (02)
  • [2] Dynamic Modality Interaction Modeling for Image-Text Retrieval
    Qu, Leigang
    Liu, Meng
    Wu, Jianlong
    Gao, Zan
    Nie, Liqiang
    [J]. SIGIR '21 - PROCEEDINGS OF THE 44TH INTERNATIONAL ACM SIGIR CONFERENCE ON RESEARCH AND DEVELOPMENT IN INFORMATION RETRIEVAL, 2021, : 1104 - 1113
  • [3] External Knowledge Dynamic Modeling for Image-text Retrieval
    Yang, Song
    Li, Qiang
    Li, Wenhui
    Liu, Min
    Li, Xuanya
    Liu, Anan
    [J]. PROCEEDINGS OF THE 31ST ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA, MM 2023, 2023, : 5330 - 5338
  • [4] Cross-modal Contrastive Learning for Generalizable and Efficient Image-text Retrieval
    Haoyu Lu
    Yuqi Huo
    Mingyu Ding
    Nanyi Fei
    Zhiwu Lu
    [J]. Machine Intelligence Research, 2023, 20 : 569 - 582
  • [5] Cross-modal Contrastive Learning for Generalizable and Efficient Image-text Retrieval
    Lu, Haoyu
    Huo, Yuqi
    Ding, Mingyu
    Fei, Nanyi
    Lu, Zhiwu
    [J]. MACHINE INTELLIGENCE RESEARCH, 2023, 20 (04) : 569 - 582
  • [6] CODER: Coupled Diversity-Sensitive Momentum Contrastive Learning for Image-Text Retrieval
    Wang, Haoran
    He, Dongliang
    Wu, Wenhao
    Xia, Boyang
    Yang, Min
    Li, Fu
    Yu, Yunlong
    Ji, Zhong
    Ding, Errui
    Wang, Jingdong
    [J]. COMPUTER VISION, ECCV 2022, PT XXXVI, 2022, 13696 : 700 - 716
  • [7] Efficient Token-Guided Image-Text Retrieval With Consistent Multimodal Contrastive Training
    Liu, Chong
    Zhang, Yuqi
    Wang, Hongsong
    Chen, Weihua
    Wang, Fan
    Huang, Yan
    Shen, Yi-Dong
    Wang, Liang
    [J]. IEEE TRANSACTIONS ON IMAGE PROCESSING, 2023, 32 : 3622 - 3633
  • [8] Masking-Based Cross-Modal Remote Sensing Image-Text Retrieval via Dynamic Contrastive Learning
    Zhao, Zuopeng
    Miao, Xiaoran
    He, Chen
    Hu, Jianfeng
    Min, Bingbing
    Gao, Yumeng
    Liu, Ying
    Pharksuwan, Kanyaphakphachsorn
    [J]. IEEE TRANSACTIONS ON GEOSCIENCE AND REMOTE SENSING, 2024, 62 : 1 - 15
  • [9] Compositional Learning of Image-Text Query for Image Retrieval
    Anwaar, Muhammad Umer
    Labintcev, Egor
    Kleinsteuber, Martin
    [J]. 2021 IEEE WINTER CONFERENCE ON APPLICATIONS OF COMPUTER VISION (WACV 2021), 2021, : 1139 - 1148
  • [10] CAliC: Accurate and Efficient Image-Text Retrieval via Contrastive Alignment and Visual Contexts Modeling
    Gao, Hongyu
    Zhu, Chao
    Liu, Mengyin
    Gu, Weibo
    Wang, Hongfa
    Liu, Wei
    Yin, Xu-Cheng
    [J]. PROCEEDINGS OF THE 30TH ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA, MM 2022, 2022, : 4957 - 4966