USER: Unified Semantic Enhancement With Momentum Contrast for Image-Text Retrieval

被引：4

作者：

Zhang, Yan ^{[1
]}

Ji, Zhong ^{[1
,2
]}

Wang, Di ^{[1
]}

Pang, Yanwei ^{[1
,2
]}

Li, Xuelong ^{[3
,4
]}

机构：

[1] Tianjin Univ, Sch Elect & Informat Engn, Tianjin Key Lab Brain Inspired Intelligence Techn, Tianjin 300072, Peoples R China

[2] Shanghai Artificial Intelligence Lab, Shanghai 200232, Peoples R China

[3] Northwestern Polytech Univ, Minist Ind & Informat Technol, Key Lab Intelligent Interact & Applicat, Xian 710072, Peoples R China

[4] Northwestern Polytech Univ, Sch Artificial Intelligence OPt & Elect iOPEN, Xian 710072, Peoples R China

来源：

IEEE TRANSACTIONS ON IMAGE PROCESSING | 2024年 / 33卷

关键词：

Image-text retrieval; semantic enhancement; momentum contrast; dynamic queue; TRANSFORMER;

D O I：

10.1109/TIP.2023.3348297

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

As a fundamental and challenging task in bridging language and vision domains, Image-Text Retrieval (ITR) aims at searching for the target instances that are semantically relevant to the given query from the other modality, and its key challenge is to measure the semantic similarity across different modalities. Although significant progress has been achieved, existing approaches typically suffer from two major limitations: (1) It hurts the accuracy of the representation by directly exploiting the bottom-up attention based region-level features where each region is equally treated. (2) It limits the scale of negative sample pairs by employing the mini-batch based end-to-end training mechanism. To address these limitations, we propose a Unified Semantic Enhancement Momentum Contrastive Learning (USER) method for ITR. Specifically, we delicately design two simple but effective Global representation based Semantic Enhancement (GSE) modules. One learns the global representation via the self-attention algorithm, noted as Self-Guided Enhancement (SGE) module. The other module benefits from the pre-trained CLIP module, which provides a novel scheme to exploit and transfer the knowledge from an off-the-shelf model, noted as CLIP-Guided Enhancement (CGE) module. Moreover, we incorporate the training mechanism of MoCo into ITR, in which two dynamic queues are employed to enrich and enlarge the scale of negative sample pairs. Meanwhile, a Unified Training Objective (UTO) is developed to learn from mini-batch based and dynamic queue based samples. Extensive experiments on the benchmark MSCOCO and Flickr30K datasets demonstrate the superiority of both retrieval accuracy and inference efficiency. For instance, compared with the existing best method NAAF, the metric R@1 of our USER on the MSCOCO 5K Testing set is improved by 5% and 2.4% on caption retrieval and image retrieval without any external knowledge or pre-trained model while enjoying over 60 times faster inference speed. Our source code will be released at https://github.com/zhangy0822/USER.

引用

页码：595 / 609

页数：15

共 50 条

[21] Characterization and classification of semantic image-text relations
Christian Otto
Matthias Springstein
Avishek Anand
Ralph Ewerth
International Journal of Multimedia Information Retrieval, 2020, 9 : 31 - 45
[22] Kernel triplet loss for image-text retrieval
Pan, Zhengxin
Wu, Fangyu
Zhang, Bailing
COMPUTER ANIMATION AND VIRTUAL WORLDS, 2022, 33 (3-4)
[23] DEEP RANK CROSS-MODAL HASHING WITH SEMANTIC CONSISTENT FOR IMAGE-TEXT RETRIEVAL
Liu, Xiaoqing
Zeng, Huanqiang
Shi, Yifan
Zhu, Jianqing
Ma, Kai-Kuang
ICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing - Proceedings, 2022, 2022-May : 4828 - 4832
[24] Visual Contextual Semantic Reasoning for Cross-Modal Drone Image-Text Retrieval
Huang, Jinghao
Chen, Yaxiong
Xiong, Shengwu
Lu, Xiaoqiang
IEEE TRANSACTIONS ON GEOSCIENCE AND REMOTE SENSING, 2024, 62
[25] Scale-Semantic Joint Decoupling Network for Image-Text Retrieval in Remote Sensing
Zheng, Chengyu
Song, Ning
Zhang, Ruoyu
Huang, Lei
Wei, Zhiqiang
Nie, Jie
ACM TRANSACTIONS ON MULTIMEDIA COMPUTING COMMUNICATIONS AND APPLICATIONS, 2024, 20 (01)
[26] DEEP RANK CROSS-MODAL HASHING WITH SEMANTIC CONSISTENT FOR IMAGE-TEXT RETRIEVAL
Liu, Xiaoqing
Zeng, Huanqiang
Shi, Yifan
Zhu, Jianqing
Ma, Kai-Kuang
2022 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2022, : 4828 - 4832
[27] Reservoir Computing Transformer for Image-Text Retrieval
Li, Wenrui
Ma, Zhengyu
Deng, Liang-Jian
Wang, Penghong
Shi, Jinqiao
Fan, Xiaopeng
PROCEEDINGS OF THE 31ST ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA, MM 2023, 2023, : 5605 - 5613
[28] Mutil-level Local Alignment and Semantic Matching Network for Image-Text Retrieval
Jiang, Zhukai
Lian, Zhichao
ARTIFICIAL NEURAL NETWORKS AND MACHINE LEARNING - ICANN 2022, PT III, 2022, 13531 : 212 - 224
[29] Dynamic Contrastive Distillation for Image-Text Retrieval
Rao, Jun
Ding, Liang
Qi, Shuhan
Fang, Meng
Liu, Yang
Shen, Li
Tao, Dacheng
IEEE TRANSACTIONS ON MULTIMEDIA, 2023, 25 : 8383 - 8395
[30] MKVSE: Multimodal Knowledge Enhanced Visual-semantic Embedding for Image-text Retrieval
Feng, Duoduo
He, Xiangteng
Peng, Yuxin
ACM TRANSACTIONS ON MULTIMEDIA COMPUTING COMMUNICATIONS AND APPLICATIONS, 2023, 19 (05)

← 1 2 3 4 5 →