Unified Adaptive Relevance Distinguishable Attention Network for Image-Text Matching

被引:31
|
作者
Zhang, Kun [1 ]
Mao, Zhendong [1 ]
Liu, An-An [3 ]
Zhang, Yongdong [1 ,2 ]
机构
[1] Univ Sci & Technol China, Sch Informat Sci & Technol, Hefei 230022, Anhui, Peoples R China
[2] Hefei Comprehens Natl Sci Ctr, Inst Artificial Intelligence, Hefei 230022, Anhui, Peoples R China
[3] Tianjin Univ, Sch Elect & Informat Engn, Tianjin 300072, Peoples R China
关键词
Semantics; Optimization; Visualization; Training; Task analysis; Representation learning; Correlation; Image-text matching; attention network; unified adaptive relevance distinguishable learning;
D O I
10.1109/TMM.2022.3141603
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
Image-text matching, as a fundamental cross-modal task, bridges the gap between vision and language. The core is to accurately learn semantic alignment to find relevant shared semantics in image and text. Existing methods typically attend to all fragments with word-region similarity greater than empirical threshold zero as relevant shared semantics, e.g., via a ReLU operation that forces the negative to zero and maintains the positive. However, this fixed threshold is totally isolated with feature learning, which cannot adaptively and accurately distinguish the varying distributions of relevant and irrelevant word-region similarity in training, inevitably limiting the semantic alignment learning. To solve this issue, we propose a novel Unified Adaptive Relevance Distinguishable Attention (UARDA) mechanism, incorporating the relevance threshold into a unified learning framework, to maximally distinguish the relevant and irrelevant distributions to obtain better semantic alignment. Specifically, our method adaptively learns the optimal relevance boundary between these two distributions to improve the model to learn more discriminative features. The explicit relevance threshold is well integrated into similarity matching, which kills two birds with one stone as: (1) excluding the disturbances of irrelevant fragment contents to aggregate precisely relevant shared semantics for boosting matching accuracy, and (2) avoiding the calculation of irrelevant fragment queries for reducing retrieval time. Experimental results on benchmarks show that UARDA can substantially and consistently outperform state-of-the-arts, with relative rSum improvements of 2%-4% (16.9%-35.3% for baseline SCAN), and reducing the retrieval time by 50%-73%.
引用
收藏
页码:1320 / 1332
页数:13
相关论文
共 50 条
  • [1] Position Focused Attention Network for Image-Text Matching
    Wang, Yaxiong
    Yang, Hao
    Qian, Xueming
    Ma, Lin
    Lu, Jing
    Li, Biao
    Fan, Xin
    [J]. PROCEEDINGS OF THE TWENTY-EIGHTH INTERNATIONAL JOINT CONFERENCE ON ARTIFICIAL INTELLIGENCE, 2019, : 3792 - 3798
  • [2] Dual Semantic Relationship Attention Network for Image-Text Matching
    Wen, Keyu
    Gu, Xiaodong
    [J]. 2020 INTERNATIONAL JOINT CONFERENCE ON NEURAL NETWORKS (IJCNN), 2020,
  • [3] Rare-aware attention network for image-text matching
    Wang, Yan
    Su, Yuting
    Li, Wenhui
    Sun, Zhengya
    Wei, Zhiqiang
    Nie, Jie
    Li, Xuanya
    Liu, An-An
    [J]. INFORMATION PROCESSING & MANAGEMENT, 2023, 60 (03)
  • [4] Cross Attention Graph Matching Network for Image-Text Retrieval
    Yang, Xiaoyu
    Xie, Hao
    Mao, Junyi
    Wang, Zhiguo
    Yin, Guangqiang
    [J]. PROCEEDINGS OF THE 13TH INTERNATIONAL CONFERENCE ON COMPUTER ENGINEERING AND NETWORKS, VOL II, CENET 2023, 2024, 1126 : 274 - 286
  • [5] Reference-Aware Adaptive Network for Image-Text Matching
    Xiong G.
    Meng M.
    Zhang T.
    Zhang D.
    Zhang Y.
    [J]. IEEE Transactions on Circuits and Systems for Video Technology, 2024, 34 (10) : 1 - 1
  • [6] Focus Your Attention: A Bidirectional Focal Attention Network for Image-Text Matching
    Liu, Chunxiao
    Mao, Zhendong
    Liu, An-An
    Zhang, Tianzhu
    Wang, Bin
    Zhang, Yongdong
    [J]. PROCEEDINGS OF THE 27TH ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA (MM'19), 2019, : 3 - 11
  • [7] Global-Guided Asymmetric Attention Network for Image-Text Matching
    Wu, Dongqing
    Li, Huihui
    Tang, Yinge
    Guo, Lei
    Liu, Hang
    [J]. NEUROCOMPUTING, 2022, 481 : 77 - 90
  • [8] Fusion layer attention for image-text matching
    Wang, Depeng
    Wang, Liejun
    Song, Shiji
    Huang, Gao
    Guo, Yuchen
    Cheng, Shuli
    Ao, Naixiang
    Du, Anyu
    [J]. NEUROCOMPUTING, 2021, 442 : 249 - 259
  • [9] Global-Guided Asymmetric Attention Network for Image-Text Matching
    Wu, Dongqing
    Li, Huihui
    Tang, Yinge
    Guo, Lei
    Liu, Hang
    [J]. Neurocomputing, 2022, 481 : 77 - 90
  • [10] Stacked Cross Attention for Image-Text Matching
    Lee, Kuang-Huei
    Chen, Xi
    Hua, Gang
    Hu, Houdong
    He, Xiaodong
    [J]. COMPUTER VISION - ECCV 2018, PT IV, 2018, 11208 : 212 - 228