Multi-level network based on transformer encoder for fine-grained image-text matching

被引:2
|
作者
Yang, Lei [1 ]
Feng, Yong [1 ]
Zhou, Mingliang [1 ]
Xiong, Xiancai [2 ,3 ]
Wang, Yongheng [4 ]
Qiang, Baohua [5 ]
机构
[1] Chongqing Univ, Coll Comp Sci, Chongqing 400044, Peoples R China
[2] Minist Nat Resources, Key Lab Monitoring Evaluat & Early Warning Terr Sp, Chongqing 401147, Peoples R China
[3] Chongqing Inst Planning & Nat Resources Invest & M, Chongqing 401121, Peoples R China
[4] Zhejiang Lab, Hangzhou 311121, Peoples R China
[5] Guilin Univ Elect Technol, Guangxi Key Lab Trusted Software, Guilin 541004, Peoples R China
基金
中国国家自然科学基金;
关键词
Image-text matching; Multi-level network; Transformer encoder; Fine-grained information; INFORMATION;
D O I
10.1007/s00530-023-01079-w
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
Enabling image-text matching is important to understand both vision and language. Existing methods utilize the cross attention mechanism to explore deep semantic information. However, the majority of these methods need to perform two types of alignment, which is extremely time-consuming. In addition, current methods do not consider the digital information within the image or text, which may lead to a reduction in retrieval performance. In this paper, we propose a multi-level network, which is based on the transformer encoder for fine-grained, image-text matching. First, we use the transformer encoder to extract intra-modality relations within the image and text and perform the alignment through an efficient aggregating method, rendering the alignment more efficient and the intra-modality information fully utilized. Second, we capture the discriminative digital information within the image and text to make the representation more distinguishable. Finally, we utilize the global information of the image and text as complementary information to enhance the representation. According to our experimental results, significant improvements in terms of retrieval tasks and runtime estimation can be achieved compared with state-of-the-art algorithms. The source code is available at https://github.com/CQULab/MNTE.
引用
收藏
页码:1981 / 1994
页数:14
相关论文
共 50 条
  • [1] Multi-level network based on transformer encoder for fine-grained image–text matching
    Lei Yang
    Yong Feng
    Mingliang Zhou
    Xiancai Xiong
    Yongheng Wang
    Baohua Qiang
    [J]. Multimedia Systems, 2023, 29 : 1981 - 1994
  • [2] Fine-grained Image-text Matching by Cross-modal Hard Aligning Network
    Pan, Zhengxin
    Wu, Fangyu
    Zhang, Bailing
    [J]. 2023 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2023, : 19275 - 19284
  • [3] Multi-level Symmetric Semantic Alignment Network for image-text matching
    Wang, Wenzhuang
    Di, Xiaoguang
    Liu, Maozhen
    Gao, Feng
    [J]. NEUROCOMPUTING, 2024, 599
  • [4] Multi-Level Region Matching for Fine-Grained Sketch-Based Image Retrieval
    Ling, Zhixin
    Xing, Zhen
    Li, Jiangtong
    Niu, Li
    [J]. PROCEEDINGS OF THE 30TH ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA, MM 2022, 2022,
  • [5] Fine-Grained Bidirectional Attention-Based Generative Networks for Image-Text Matching
    Li, Zhixin
    Zhu, Jianwei
    Wei, Jiahui
    Zeng, Yufei
    [J]. MACHINE LEARNING AND KNOWLEDGE DISCOVERY IN DATABASES, ECML PKDD 2022, PT III, 2023, 13715 : 390 - 406
  • [6] Image-text matching algorithm based on multi-level semantic alignment
    Li Y.
    Yao T.
    Zhang L.
    Sun Y.
    Fu H.
    [J]. Beijing Hangkong Hangtian Daxue Xuebao/Journal of Beijing University of Aeronautics and Astronautics, 2024, 50 (02): : 551 - 558
  • [7] Fine-grained Image Caption based on Multi-level Attention
    Yang Zhenyu
    Zhang Jiao
    [J]. PROCEEDINGS OF 2019 IEEE 7TH INTERNATIONAL CONFERENCE ON COMPUTER SCIENCE AND NETWORK TECHNOLOGY (ICCSNT 2019), 2019, : 72 - 78
  • [8] Multi-level information fusion Transformer with background filter for fine-grained image recognition
    Yu, Ying
    Wang, Jinghui
    Pedrycz, Witold
    Miao, Duoqian
    Qian, Jin
    [J]. APPLIED INTELLIGENCE, 2024, 54 (17-18) : 8108 - 8119
  • [9] TECMH: Transformer-Based Cross-Modal Hashing For Fine-Grained Image-Text Retrieval
    Li, Qiqi
    Ma, Longfei
    Jiang, Zheng
    Li, Mingyong
    Jin, Bo
    [J]. CMC-COMPUTERS MATERIALS & CONTINUA, 2023, 75 (02): : 3713 - 3728
  • [10] Learning Relationship-Enhanced Semantic Graph for Fine-Grained Image-Text Matching
    Liu, Xin
    He, Yi
    Cheung, Yiu-Ming
    Xu, Xing
    Wang, Nannan
    [J]. IEEE TRANSACTIONS ON CYBERNETICS, 2024, 54 (02) : 948 - 961