USER: Unified Semantic Enhancement With Momentum Contrast for Image-Text Retrieval

被引:4
|
作者
Zhang, Yan [1 ]
Ji, Zhong [1 ,2 ]
Wang, Di [1 ]
Pang, Yanwei [1 ,2 ]
Li, Xuelong [3 ,4 ]
机构
[1] Tianjin Univ, Sch Elect & Informat Engn, Tianjin Key Lab Brain Inspired Intelligence Techn, Tianjin 300072, Peoples R China
[2] Shanghai Artificial Intelligence Lab, Shanghai 200232, Peoples R China
[3] Northwestern Polytech Univ, Minist Ind & Informat Technol, Key Lab Intelligent Interact & Applicat, Xian 710072, Peoples R China
[4] Northwestern Polytech Univ, Sch Artificial Intelligence OPt & Elect iOPEN, Xian 710072, Peoples R China
关键词
Image-text retrieval; semantic enhancement; momentum contrast; dynamic queue; TRANSFORMER;
D O I
10.1109/TIP.2023.3348297
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
As a fundamental and challenging task in bridging language and vision domains, Image-Text Retrieval (ITR) aims at searching for the target instances that are semantically relevant to the given query from the other modality, and its key challenge is to measure the semantic similarity across different modalities. Although significant progress has been achieved, existing approaches typically suffer from two major limitations: (1) It hurts the accuracy of the representation by directly exploiting the bottom-up attention based region-level features where each region is equally treated. (2) It limits the scale of negative sample pairs by employing the mini-batch based end-to-end training mechanism. To address these limitations, we propose a Unified Semantic Enhancement Momentum Contrastive Learning (USER) method for ITR. Specifically, we delicately design two simple but effective Global representation based Semantic Enhancement (GSE) modules. One learns the global representation via the self-attention algorithm, noted as Self-Guided Enhancement (SGE) module. The other module benefits from the pre-trained CLIP module, which provides a novel scheme to exploit and transfer the knowledge from an off-the-shelf model, noted as CLIP-Guided Enhancement (CGE) module. Moreover, we incorporate the training mechanism of MoCo into ITR, in which two dynamic queues are employed to enrich and enlarge the scale of negative sample pairs. Meanwhile, a Unified Training Objective (UTO) is developed to learn from mini-batch based and dynamic queue based samples. Extensive experiments on the benchmark MSCOCO and Flickr30K datasets demonstrate the superiority of both retrieval accuracy and inference efficiency. For instance, compared with the existing best method NAAF, the metric R@1 of our USER on the MSCOCO 5K Testing set is improved by 5% and 2.4% on caption retrieval and image retrieval without any external knowledge or pre-trained model while enjoying over 60 times faster inference speed. Our source code will be released at https://github.com/zhangy0822/USER.
引用
收藏
页码:595 / 609
页数:15
相关论文
共 50 条
  • [21] Characterization and classification of semantic image-text relations
    Christian Otto
    Matthias Springstein
    Avishek Anand
    Ralph Ewerth
    International Journal of Multimedia Information Retrieval, 2020, 9 : 31 - 45
  • [22] Kernel triplet loss for image-text retrieval
    Pan, Zhengxin
    Wu, Fangyu
    Zhang, Bailing
    COMPUTER ANIMATION AND VIRTUAL WORLDS, 2022, 33 (3-4)
  • [23] DEEP RANK CROSS-MODAL HASHING WITH SEMANTIC CONSISTENT FOR IMAGE-TEXT RETRIEVAL
    Liu, Xiaoqing
    Zeng, Huanqiang
    Shi, Yifan
    Zhu, Jianqing
    Ma, Kai-Kuang
    ICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing - Proceedings, 2022, 2022-May : 4828 - 4832
  • [24] Visual Contextual Semantic Reasoning for Cross-Modal Drone Image-Text Retrieval
    Huang, Jinghao
    Chen, Yaxiong
    Xiong, Shengwu
    Lu, Xiaoqiang
    IEEE TRANSACTIONS ON GEOSCIENCE AND REMOTE SENSING, 2024, 62
  • [25] Scale-Semantic Joint Decoupling Network for Image-Text Retrieval in Remote Sensing
    Zheng, Chengyu
    Song, Ning
    Zhang, Ruoyu
    Huang, Lei
    Wei, Zhiqiang
    Nie, Jie
    ACM TRANSACTIONS ON MULTIMEDIA COMPUTING COMMUNICATIONS AND APPLICATIONS, 2024, 20 (01)
  • [26] DEEP RANK CROSS-MODAL HASHING WITH SEMANTIC CONSISTENT FOR IMAGE-TEXT RETRIEVAL
    Liu, Xiaoqing
    Zeng, Huanqiang
    Shi, Yifan
    Zhu, Jianqing
    Ma, Kai-Kuang
    2022 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2022, : 4828 - 4832
  • [27] Reservoir Computing Transformer for Image-Text Retrieval
    Li, Wenrui
    Ma, Zhengyu
    Deng, Liang-Jian
    Wang, Penghong
    Shi, Jinqiao
    Fan, Xiaopeng
    PROCEEDINGS OF THE 31ST ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA, MM 2023, 2023, : 5605 - 5613
  • [28] Mutil-level Local Alignment and Semantic Matching Network for Image-Text Retrieval
    Jiang, Zhukai
    Lian, Zhichao
    ARTIFICIAL NEURAL NETWORKS AND MACHINE LEARNING - ICANN 2022, PT III, 2022, 13531 : 212 - 224
  • [29] Dynamic Contrastive Distillation for Image-Text Retrieval
    Rao, Jun
    Ding, Liang
    Qi, Shuhan
    Fang, Meng
    Liu, Yang
    Shen, Li
    Tao, Dacheng
    IEEE TRANSACTIONS ON MULTIMEDIA, 2023, 25 : 8383 - 8395
  • [30] MKVSE: Multimodal Knowledge Enhanced Visual-semantic Embedding for Image-text Retrieval
    Feng, Duoduo
    He, Xiangteng
    Peng, Yuxin
    ACM TRANSACTIONS ON MULTIMEDIA COMPUTING COMMUNICATIONS AND APPLICATIONS, 2023, 19 (05)