ESA: External Space Attention Aggregation for Image-Text Retrieval

被引：12

作者：

Zhu, Hongguang ^{[1
,2
]}

Zhang, Chunjie ^{[1
,2
]}

Wei, Yunchao ^{[1
,2
,3
]}

Huang, Shujuan ^{[1
,2
]}

Zhao, Yao ^{[1
,2
,3
]}

机构：

[1] Beijing Jiaotong Univ, Inst Informat Sci, Beijing 100044, Peoples R China

[2] Beijing Key Lab Adv Informat Sci & Network Technol, Beijing 100044, Peoples R China

[3] Peng Cheng Lab, Shenzhen 518066, Peoples R China

来源：

IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY | 2023年 / 33卷 / 10期

基金：

中国国家自然科学基金; 北京市自然科学基金;

关键词：

Image-text retrieval; visual-semantic embedding;

D O I：

10.1109/TCSVT.2023.3253548

中图分类号：

TM [电工技术]; TN [电子技术、通信技术];

学科分类号：

0808 ; 0809 ;

摘要：

Due to the large gap between vision and language modalities, effective and efficient image-text retrieval is still an unsolved problem. Recent progress devotes to unilaterally pursuing retrieval accuracy by either entangled image-text interaction or large-scale vision-language pre-training in a brute force way. However, the former often leads to unacceptable retrieval time explosion when deploying on large-scale databases. The latter heavily relies on the extra corpus to learn better alignment in the feature space while obscuring the contribution of the network architecture. In this work, we aim to investigate a trade-off to balance effectiveness and efficiency. To this end, on the premise of efficient retrieval, we propose the plug-and-play External Space attention Aggregation (ESA) module to enable element-wise fusion of modal features under spatial dimensional attention. Based on flexible spatial awareness, we further propose the Self-Expanding triplet Loss (SEL) to expand the representation space of samples and optimize the alignment of embedding space. The extensive experiments demonstrate the effectiveness of our method on two benchmark datasets. With identical visual and textual backbones, our single model has outperformed the ensemble modal of similar methods, and our ensemble model can further expand the advantage. Meanwhile, compared with the vision-language pre-training embedding-base method that used 83x image-text pairs than ours, our approach not only surpasses in performance but also accelerates 3x on retrieval time.

引用

页码：6131 / 6143

页数：13

共 50 条

[31] Image-text Retrieval via Preserving Main Semantics of Vision
Zhang, Xu
Niu, Xinzheng
Fournier-Viger, Philippe
Dai, Xudong
2023 IEEE INTERNATIONAL CONFERENCE ON MULTIMEDIA AND EXPO, ICME, 2023, : 1967 - 1972
[32] TFUN: Trilinear Fusion Network for Ternary Image-Text Retrieval
Xu, Xing
Sun, Jialiang
Cao, Zuo
Zhang, Yin
Zhu, Xiaofeng
Shen, Heng Tao
INFORMATION FUSION, 2023, 91 : 327 - 337
[33] Dual Stream Relation Learning Network for Image-Text Retrieval
Wu, Dongqing
Li, Huihui
Gu, Cang
Guo, Lei
Liu, Hang
IEEE TRANSACTIONS ON MULTIMEDIA, 2025, 27 : 1551 - 1565
[34] Dissecting Deep Metric Learning Losses for Image-Text Retrieval
Xuan, Hong
Chen, Xi
2023 IEEE/CVF WINTER CONFERENCE ON APPLICATIONS OF COMPUTER VISION (WACV), 2023, : 2163 - 2172
[35] Cross-modal Image-Text Retrieval with Multitask Learning
Luo, Junyu
Shen, Ying
Ao, Xiang
Zhao, Zhou
Yang, Min
PROCEEDINGS OF THE 28TH ACM INTERNATIONAL CONFERENCE ON INFORMATION & KNOWLEDGE MANAGEMENT (CIKM '19), 2019, : 2309 - 2312
[36] Learning to Embed Semantic Similarity for Joint Image-Text Retrieval
Malali, Noam
Keller, Yosi
IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, 2022, 44 (12) : 10252 - 10260
[37] Multi-level similarity learning for image-text retrieval
Li, Wen-Hui
Yang, Song
Wang, Yan
Song, Dan
Li, Xuan-Ya
INFORMATION PROCESSING & MANAGEMENT, 2021, 58 (01)
[38] Text-Guided Knowledge Transfer for Remote Sensing Image-Text Retrieval
Liu, An-An
Yang, Bo
Li, Wenhui
Song, Dan
Sun, Zhengya
Ren, Tongwei
Wei, Zhiqiang
IEEE GEOSCIENCE AND REMOTE SENSING LETTERS, 2024, 21 : 1 - 5
[39] Scene Graph based Fusion Network for Image-Text Retrieval
Wang, Guoliang
Shang, Yanlei
Chen, Yong
Zhen, Chaoqi
Cheng, Dequan
2023 IEEE INTERNATIONAL CONFERENCE ON MULTIMEDIA AND EXPO, ICME, 2023, : 138 - 143
[40] HGAN: Hierarchical Graph Alignment Network for Image-Text Retrieval
Guo, Jie
Wang, Meiting
Zhou, Yan
Song, Bin
Chi, Yuhao
Fan, Wei
Chang, Jianglong
IEEE TRANSACTIONS ON MULTIMEDIA, 2023, 25 : 9189 - 9202

← 1 2 3 4 5 →