Unified Adaptive Relevance Distinguishable Attention Network for Image-Text Matching

被引:31
|
作者
Zhang, Kun [1 ]
Mao, Zhendong [1 ]
Liu, An-An [3 ]
Zhang, Yongdong [1 ,2 ]
机构
[1] Univ Sci & Technol China, Sch Informat Sci & Technol, Hefei 230022, Anhui, Peoples R China
[2] Hefei Comprehens Natl Sci Ctr, Inst Artificial Intelligence, Hefei 230022, Anhui, Peoples R China
[3] Tianjin Univ, Sch Elect & Informat Engn, Tianjin 300072, Peoples R China
关键词
Semantics; Optimization; Visualization; Training; Task analysis; Representation learning; Correlation; Image-text matching; attention network; unified adaptive relevance distinguishable learning;
D O I
10.1109/TMM.2022.3141603
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
Image-text matching, as a fundamental cross-modal task, bridges the gap between vision and language. The core is to accurately learn semantic alignment to find relevant shared semantics in image and text. Existing methods typically attend to all fragments with word-region similarity greater than empirical threshold zero as relevant shared semantics, e.g., via a ReLU operation that forces the negative to zero and maintains the positive. However, this fixed threshold is totally isolated with feature learning, which cannot adaptively and accurately distinguish the varying distributions of relevant and irrelevant word-region similarity in training, inevitably limiting the semantic alignment learning. To solve this issue, we propose a novel Unified Adaptive Relevance Distinguishable Attention (UARDA) mechanism, incorporating the relevance threshold into a unified learning framework, to maximally distinguish the relevant and irrelevant distributions to obtain better semantic alignment. Specifically, our method adaptively learns the optimal relevance boundary between these two distributions to improve the model to learn more discriminative features. The explicit relevance threshold is well integrated into similarity matching, which kills two birds with one stone as: (1) excluding the disturbances of irrelevant fragment contents to aggregate precisely relevant shared semantics for boosting matching accuracy, and (2) avoiding the calculation of irrelevant fragment queries for reducing retrieval time. Experimental results on benchmarks show that UARDA can substantially and consistently outperform state-of-the-arts, with relative rSum improvements of 2%-4% (16.9%-35.3% for baseline SCAN), and reducing the retrieval time by 50%-73%.
引用
下载
收藏
页码:1320 / 1332
页数:13
相关论文
共 50 条
  • [21] Bi-Attention enhanced representation learning for image-text matching
    Tian, Yumin
    Ding, Aqiang
    Wang, Di
    Luo, Xuemei
    Wan, Bo
    Wang, Yifeng
    PATTERN RECOGNITION, 2023, 140
  • [22] Globally Guided Confidence Enhancement Network for Image-Text Matching
    Dai, Xin
    Tuerhong, Gulanbaier
    Wushouer, Mairidan
    APPLIED SCIENCES-BASEL, 2023, 13 (09):
  • [23] Learning Fragment Self-Attention Embeddings for Image-Text Matching
    Wu, Yiling
    Wang, Shuhui
    Song, Guoli
    Huang, Qingming
    PROCEEDINGS OF THE 27TH ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA (MM'19), 2019, : 2088 - 2096
  • [24] Cross-Modal Attention With Semantic Consistence for Image-Text Matching
    Xu, Xing
    Wang, Tan
    Yang, Yang
    Zuo, Lin
    Shen, Fumin
    Shen, Heng Tao
    IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, 2020, 31 (12) : 5412 - 5425
  • [25] Learning Dual Semantic Relations With Graph Attention for Image-Text Matching
    Wen, Keyu
    Gu, Xiaodong
    Cheng, Qingrong
    IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, 2021, 31 (07) : 2866 - 2879
  • [26] Self-attention guided representation learning for image-text matching
    Qi, Xuefei
    Zhang, Ying
    Qi, Jinqing
    Lu, Huchuan
    NEUROCOMPUTING, 2021, 450 : 143 - 155
  • [27] A Mutually Textual and Visual Refinement Network for Image-Text Matching
    Pang, Shanmin
    Zeng, Yueyang
    Zhao, Jiawei
    Xue, Jianru
    IEEE TRANSACTIONS ON MULTIMEDIA, 2024, 26 : 7555 - 7566
  • [28] Region Reinforcement Network With Topic Constraint for Image-Text Matching
    Wu, Jie
    Wu, Chunlei
    Lu, Jing
    Wang, Leiquan
    Cui, Xuerong
    IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, 2022, 32 (01) : 388 - 397
  • [29] A Multiview Text Imagination Network Based on Latent Alignment for Image-Text Matching
    Shang, Heng
    Zhao, Guoshuai
    Shi, Jing
    Qian, Xueming
    IEEE INTELLIGENT SYSTEMS, 2023, 38 (03) : 41 - 50
  • [30] Team HUGE: Image-Text Matching via Hierarchical and Unified Graph Enhancing
    Li, Bo
    Wu, You
    Li, Zhixin
    PROCEEDINGS OF THE 4TH ANNUAL ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA RETRIEVAL, ICMR 2024, 2024, : 704 - 712