ESA: External Space Attention Aggregation for Image-Text Retrieval

被引:12
|
作者
Zhu, Hongguang [1 ,2 ]
Zhang, Chunjie [1 ,2 ]
Wei, Yunchao [1 ,2 ,3 ]
Huang, Shujuan [1 ,2 ]
Zhao, Yao [1 ,2 ,3 ]
机构
[1] Beijing Jiaotong Univ, Inst Informat Sci, Beijing 100044, Peoples R China
[2] Beijing Key Lab Adv Informat Sci & Network Technol, Beijing 100044, Peoples R China
[3] Peng Cheng Lab, Shenzhen 518066, Peoples R China
基金
中国国家自然科学基金; 北京市自然科学基金;
关键词
Image-text retrieval; visual-semantic embedding;
D O I
10.1109/TCSVT.2023.3253548
中图分类号
TM [电工技术]; TN [电子技术、通信技术];
学科分类号
0808 ; 0809 ;
摘要
Due to the large gap between vision and language modalities, effective and efficient image-text retrieval is still an unsolved problem. Recent progress devotes to unilaterally pursuing retrieval accuracy by either entangled image-text interaction or large-scale vision-language pre-training in a brute force way. However, the former often leads to unacceptable retrieval time explosion when deploying on large-scale databases. The latter heavily relies on the extra corpus to learn better alignment in the feature space while obscuring the contribution of the network architecture. In this work, we aim to investigate a trade-off to balance effectiveness and efficiency. To this end, on the premise of efficient retrieval, we propose the plug-and-play External Space attention Aggregation (ESA) module to enable element-wise fusion of modal features under spatial dimensional attention. Based on flexible spatial awareness, we further propose the Self-Expanding triplet Loss (SEL) to expand the representation space of samples and optimize the alignment of embedding space. The extensive experiments demonstrate the effectiveness of our method on two benchmark datasets. With identical visual and textual backbones, our single model has outperformed the ensemble modal of similar methods, and our ensemble model can further expand the advantage. Meanwhile, compared with the vision-language pre-training embedding-base method that used 83x image-text pairs than ours, our approach not only surpasses in performance but also accelerates 3x on retrieval time.
引用
收藏
页码:6131 / 6143
页数:13
相关论文
共 50 条
  • [31] Image-text Retrieval via Preserving Main Semantics of Vision
    Zhang, Xu
    Niu, Xinzheng
    Fournier-Viger, Philippe
    Dai, Xudong
    2023 IEEE INTERNATIONAL CONFERENCE ON MULTIMEDIA AND EXPO, ICME, 2023, : 1967 - 1972
  • [32] TFUN: Trilinear Fusion Network for Ternary Image-Text Retrieval
    Xu, Xing
    Sun, Jialiang
    Cao, Zuo
    Zhang, Yin
    Zhu, Xiaofeng
    Shen, Heng Tao
    INFORMATION FUSION, 2023, 91 : 327 - 337
  • [33] Dual Stream Relation Learning Network for Image-Text Retrieval
    Wu, Dongqing
    Li, Huihui
    Gu, Cang
    Guo, Lei
    Liu, Hang
    IEEE TRANSACTIONS ON MULTIMEDIA, 2025, 27 : 1551 - 1565
  • [34] Dissecting Deep Metric Learning Losses for Image-Text Retrieval
    Xuan, Hong
    Chen, Xi
    2023 IEEE/CVF WINTER CONFERENCE ON APPLICATIONS OF COMPUTER VISION (WACV), 2023, : 2163 - 2172
  • [35] Cross-modal Image-Text Retrieval with Multitask Learning
    Luo, Junyu
    Shen, Ying
    Ao, Xiang
    Zhao, Zhou
    Yang, Min
    PROCEEDINGS OF THE 28TH ACM INTERNATIONAL CONFERENCE ON INFORMATION & KNOWLEDGE MANAGEMENT (CIKM '19), 2019, : 2309 - 2312
  • [36] Learning to Embed Semantic Similarity for Joint Image-Text Retrieval
    Malali, Noam
    Keller, Yosi
    IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, 2022, 44 (12) : 10252 - 10260
  • [37] Multi-level similarity learning for image-text retrieval
    Li, Wen-Hui
    Yang, Song
    Wang, Yan
    Song, Dan
    Li, Xuan-Ya
    INFORMATION PROCESSING & MANAGEMENT, 2021, 58 (01)
  • [38] Text-Guided Knowledge Transfer for Remote Sensing Image-Text Retrieval
    Liu, An-An
    Yang, Bo
    Li, Wenhui
    Song, Dan
    Sun, Zhengya
    Ren, Tongwei
    Wei, Zhiqiang
    IEEE GEOSCIENCE AND REMOTE SENSING LETTERS, 2024, 21 : 1 - 5
  • [39] Scene Graph based Fusion Network for Image-Text Retrieval
    Wang, Guoliang
    Shang, Yanlei
    Chen, Yong
    Zhen, Chaoqi
    Cheng, Dequan
    2023 IEEE INTERNATIONAL CONFERENCE ON MULTIMEDIA AND EXPO, ICME, 2023, : 138 - 143
  • [40] HGAN: Hierarchical Graph Alignment Network for Image-Text Retrieval
    Guo, Jie
    Wang, Meiting
    Zhou, Yan
    Song, Bin
    Chi, Yuhao
    Fan, Wei
    Chang, Jianglong
    IEEE TRANSACTIONS ON MULTIMEDIA, 2023, 25 : 9189 - 9202