ESA: External Space Attention Aggregation for Image-Text Retrieval

被引:12
|
作者
Zhu, Hongguang [1 ,2 ]
Zhang, Chunjie [1 ,2 ]
Wei, Yunchao [1 ,2 ,3 ]
Huang, Shujuan [1 ,2 ]
Zhao, Yao [1 ,2 ,3 ]
机构
[1] Beijing Jiaotong Univ, Inst Informat Sci, Beijing 100044, Peoples R China
[2] Beijing Key Lab Adv Informat Sci & Network Technol, Beijing 100044, Peoples R China
[3] Peng Cheng Lab, Shenzhen 518066, Peoples R China
基金
中国国家自然科学基金; 北京市自然科学基金;
关键词
Image-text retrieval; visual-semantic embedding;
D O I
10.1109/TCSVT.2023.3253548
中图分类号
TM [电工技术]; TN [电子技术、通信技术];
学科分类号
0808 ; 0809 ;
摘要
Due to the large gap between vision and language modalities, effective and efficient image-text retrieval is still an unsolved problem. Recent progress devotes to unilaterally pursuing retrieval accuracy by either entangled image-text interaction or large-scale vision-language pre-training in a brute force way. However, the former often leads to unacceptable retrieval time explosion when deploying on large-scale databases. The latter heavily relies on the extra corpus to learn better alignment in the feature space while obscuring the contribution of the network architecture. In this work, we aim to investigate a trade-off to balance effectiveness and efficiency. To this end, on the premise of efficient retrieval, we propose the plug-and-play External Space attention Aggregation (ESA) module to enable element-wise fusion of modal features under spatial dimensional attention. Based on flexible spatial awareness, we further propose the Self-Expanding triplet Loss (SEL) to expand the representation space of samples and optimize the alignment of embedding space. The extensive experiments demonstrate the effectiveness of our method on two benchmark datasets. With identical visual and textual backbones, our single model has outperformed the ensemble modal of similar methods, and our ensemble model can further expand the advantage. Meanwhile, compared with the vision-language pre-training embedding-base method that used 83x image-text pairs than ours, our approach not only surpasses in performance but also accelerates 3x on retrieval time.
引用
收藏
页码:6131 / 6143
页数:13
相关论文
共 50 条
  • [1] Multiview adaptive attention pooling for image-text retrieval
    Ding, Yunlai
    Yu, Jiaao
    Lv, Qingxuan
    Zhao, Haoran
    Dong, Junyu
    Li, Yuezun
    KNOWLEDGE-BASED SYSTEMS, 2024, 291
  • [2] External Knowledge Dynamic Modeling for Image-text Retrieval
    Yang, Song
    Li, Qiang
    Li, Wenhui
    Liu, Min
    Li, Xuanya
    Liu, Anan
    PROCEEDINGS OF THE 31ST ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA, MM 2023, 2023, : 5330 - 5338
  • [3] Cross Attention Graph Matching Network for Image-Text Retrieval
    Yang, Xiaoyu
    Xie, Hao
    Mao, Junyi
    Wang, Zhiguo
    Yin, Guangqiang
    PROCEEDINGS OF THE 13TH INTERNATIONAL CONFERENCE ON COMPUTER ENGINEERING AND NETWORKS, VOL II, CENET 2023, 2024, 1126 : 274 - 286
  • [4] Context-Aware Attention Network for Image-Text Retrieval
    Zhang, Qi
    Lei, Zhen
    Zhang, Zhaoxiang
    Li, Stan Z.
    2020 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2020, : 3533 - 3542
  • [5] Global Relation-Aware Attention Network for Image-Text Retrieval
    Cao, Jie
    Qian, Shengsheng
    Zhang, Huaiwen
    Fang, Quan
    Xu, Changsheng
    PROCEEDINGS OF THE 2021 INTERNATIONAL CONFERENCE ON MULTIMEDIA RETRIEVAL (ICMR '21), 2021, : 19 - 28
  • [6] Global-aware Fragment Representation Aggregation Network for image-text retrieval
    Wang, Di
    Tian, Jiabo
    Liang, Xiao
    Tian, Yumin
    He, Lihuo
    PATTERN RECOGNITION, 2025, 159
  • [7] Flexible graph-based attention and pooling network for image-text retrieval
    Sun, Hao
    Qin, Xiaolin
    Liu, Xiaojing
    MULTIMEDIA TOOLS AND APPLICATIONS, 2023, 83 (19) : 57895 - 57912
  • [8] Compositional Learning of Image-Text Query for Image Retrieval
    Anwaar, Muhammad Umer
    Labintcev, Egor
    Kleinsteuber, Martin
    2021 IEEE WINTER CONFERENCE ON APPLICATIONS OF COMPUTER VISION (WACV 2021), 2021, : 1139 - 1148
  • [9] Kernel triplet loss for image-text retrieval
    Pan, Zhengxin
    Wu, Fangyu
    Zhang, Bailing
    COMPUTER ANIMATION AND VIRTUAL WORLDS, 2022, 33 (3-4)
  • [10] Reservoir Computing Transformer for Image-Text Retrieval
    Li, Wenrui
    Ma, Zhengyu
    Deng, Liang-Jian
    Wang, Penghong
    Shi, Jinqiao
    Fan, Xiaopeng
    PROCEEDINGS OF THE 31ST ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA, MM 2023, 2023, : 5605 - 5613