USER: Unified Semantic Enhancement With Momentum Contrast for Image-Text Retrieval

被引：4

作者：

Zhang, Yan ^{[1
]}

Ji, Zhong ^{[1
,2
]}

Wang, Di ^{[1
]}

Pang, Yanwei ^{[1
,2
]}

Li, Xuelong ^{[3
,4
]}

机构：

[1] Tianjin Univ, Sch Elect & Informat Engn, Tianjin Key Lab Brain Inspired Intelligence Techn, Tianjin 300072, Peoples R China

[2] Shanghai Artificial Intelligence Lab, Shanghai 200232, Peoples R China

[3] Northwestern Polytech Univ, Minist Ind & Informat Technol, Key Lab Intelligent Interact & Applicat, Xian 710072, Peoples R China

[4] Northwestern Polytech Univ, Sch Artificial Intelligence OPt & Elect iOPEN, Xian 710072, Peoples R China

来源：

IEEE TRANSACTIONS ON IMAGE PROCESSING | 2024年 / 33卷

关键词：

Image-text retrieval; semantic enhancement; momentum contrast; dynamic queue; TRANSFORMER;

D O I：

10.1109/TIP.2023.3348297

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

As a fundamental and challenging task in bridging language and vision domains, Image-Text Retrieval (ITR) aims at searching for the target instances that are semantically relevant to the given query from the other modality, and its key challenge is to measure the semantic similarity across different modalities. Although significant progress has been achieved, existing approaches typically suffer from two major limitations: (1) It hurts the accuracy of the representation by directly exploiting the bottom-up attention based region-level features where each region is equally treated. (2) It limits the scale of negative sample pairs by employing the mini-batch based end-to-end training mechanism. To address these limitations, we propose a Unified Semantic Enhancement Momentum Contrastive Learning (USER) method for ITR. Specifically, we delicately design two simple but effective Global representation based Semantic Enhancement (GSE) modules. One learns the global representation via the self-attention algorithm, noted as Self-Guided Enhancement (SGE) module. The other module benefits from the pre-trained CLIP module, which provides a novel scheme to exploit and transfer the knowledge from an off-the-shelf model, noted as CLIP-Guided Enhancement (CGE) module. Moreover, we incorporate the training mechanism of MoCo into ITR, in which two dynamic queues are employed to enrich and enlarge the scale of negative sample pairs. Meanwhile, a Unified Training Objective (UTO) is developed to learn from mini-batch based and dynamic queue based samples. Extensive experiments on the benchmark MSCOCO and Flickr30K datasets demonstrate the superiority of both retrieval accuracy and inference efficiency. For instance, compared with the existing best method NAAF, the metric R@1 of our USER on the MSCOCO 5K Testing set is improved by 5% and 2.4% on caption retrieval and image retrieval without any external knowledge or pre-trained model while enjoying over 60 times faster inference speed. Our source code will be released at https://github.com/zhangy0822/USER.

引用

页码：595 / 609

页数：15

共 50 条

[31] Characterization and classification of semantic image-text relations
Otto, Christian
Springstein, Matthias
Anand, Avishek
Ewerth, Ralph
INTERNATIONAL JOURNAL OF MULTIMEDIA INFORMATION RETRIEVAL, 2020, 9 (01) : 31 - 45
[32] Visual Semantic Reasoning for Image-Text Matching
Li, Kunpeng
Zhang, Yulun
Li, Kai
Li, Yuanyuan
Fu, Yun
2019 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2019), 2019, : 4653 - 4661
[33] IMAGE-TEXT MATCHING WITH SHARED SEMANTIC CONCEPTS
Miao Lanxin
2022 19TH INTERNATIONAL COMPUTER CONFERENCE ON WAVELET ACTIVE MEDIA TECHNOLOGY AND INFORMATION PROCESSING (ICCWAMTIP), 2022,
[34] Text-to-Image Generation Method Based on Image-Text Semantic Consistency
Xue Z.
Xu Z.
Lang C.
Feng S.
Wang T.
Li Y.
Jisuanji Yanjiu yu Fazhan/Computer Research and Development, 2023, 60 (09): : 2180 - 2190
[35] A Deep Semantic Alignment Network for the Cross-Modal Image-Text Retrieval in Remote Sensing
Cheng, Qimin
Zhou, Yuzhuo
Fu, Peng
Xu, Yuan
Zhang, Liang
IEEE JOURNAL OF SELECTED TOPICS IN APPLIED EARTH OBSERVATIONS AND REMOTE SENSING, 2021, 14 : 4284 - 4297
[36] Dynamic Modality Interaction Modeling for Image-Text Retrieval
Qu, Leigang
Liu, Meng
Wu, Jianlong
Gao, Zan
Nie, Liqiang
SIGIR '21 - PROCEEDINGS OF THE 44TH INTERNATIONAL ACM SIGIR CONFERENCE ON RESEARCH AND DEVELOPMENT IN INFORMATION RETRIEVAL, 2021, : 1104 - 1113
[37] External Knowledge Dynamic Modeling for Image-text Retrieval
Yang, Song
Li, Qiang
Li, Wenhui
Liu, Min
Li, Xuanya
Liu, Anan
PROCEEDINGS OF THE 31ST ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA, MM 2023, 2023, : 5330 - 5338
[38] Asymmetric bi-encoder for image-text retrieval
Xiong, Wei
Liu, Haoliang
Mi, Siya
Zhang, Yu
MULTIMEDIA SYSTEMS, 2023, 29 (06) : 3805 - 3818
[39] Multiview adaptive attention pooling for image-text retrieval
Ding, Yunlai
Yu, Jiaao
Lv, Qingxuan
Zhao, Haoran
Dong, Junyu
Li, Yuezun
KNOWLEDGE-BASED SYSTEMS, 2024, 291
[40] RELATION-GUIDED NETWORK FOR IMAGE-TEXT RETRIEVAL
Yang, Yulou
Shen, Hao
Yang, Ming
2022 IEEE INTERNATIONAL CONFERENCE ON IMAGE PROCESSING, ICIP, 2022, : 1856 - 1860

← 1 2 3 4 5 →