Multi-level network based on transformer encoder for fine-grained image-text matching

被引：2

作者：

Yang, Lei ^{[1
]}

Feng, Yong ^{[1
]}

Zhou, Mingliang ^{[1
]}

Xiong, Xiancai ^{[2
,3
]}

Wang, Yongheng ^{[4
]}

Qiang, Baohua ^{[5
]}

机构：

[1] Chongqing Univ, Coll Comp Sci, Chongqing 400044, Peoples R China

[2] Minist Nat Resources, Key Lab Monitoring Evaluat & Early Warning Terr Sp, Chongqing 401147, Peoples R China

[3] Chongqing Inst Planning & Nat Resources Invest & M, Chongqing 401121, Peoples R China

[4] Zhejiang Lab, Hangzhou 311121, Peoples R China

[5] Guilin Univ Elect Technol, Guangxi Key Lab Trusted Software, Guilin 541004, Peoples R China

来源：

MULTIMEDIA SYSTEMS | 2023年 / 29卷 / 04期

基金：

中国国家自然科学基金;

关键词：

Image-text matching; Multi-level network; Transformer encoder; Fine-grained information; INFORMATION;

D O I：

10.1007/s00530-023-01079-w

中图分类号：

TP [自动化技术、计算机技术];

学科分类号：

0812 ;

摘要：

Enabling image-text matching is important to understand both vision and language. Existing methods utilize the cross attention mechanism to explore deep semantic information. However, the majority of these methods need to perform two types of alignment, which is extremely time-consuming. In addition, current methods do not consider the digital information within the image or text, which may lead to a reduction in retrieval performance. In this paper, we propose a multi-level network, which is based on the transformer encoder for fine-grained, image-text matching. First, we use the transformer encoder to extract intra-modality relations within the image and text and perform the alignment through an efficient aggregating method, rendering the alignment more efficient and the intra-modality information fully utilized. Second, we capture the discriminative digital information within the image and text to make the representation more distinguishable. Finally, we utilize the global information of the image and text as complementary information to enhance the representation. According to our experimental results, significant improvements in terms of retrieval tasks and runtime estimation can be achieved compared with state-of-the-art algorithms. The source code is available at https://github.com/CQULab/MNTE.

引用

页码：1981 / 1994

页数：14

共 50 条

[1] Multi-level network based on transformer encoder for fine-grained image–text matching
Lei Yang
Yong Feng
Mingliang Zhou
Xiancai Xiong
Yongheng Wang
Baohua Qiang
[J]. Multimedia Systems, 2023, 29 : 1981 - 1994
[2] Fine-grained Image-text Matching by Cross-modal Hard Aligning Network
Pan, Zhengxin
Wu, Fangyu
Zhang, Bailing
[J]. 2023 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2023, : 19275 - 19284
[3] Multi-level Symmetric Semantic Alignment Network for image-text matching
Wang, Wenzhuang
Di, Xiaoguang
Liu, Maozhen
Gao, Feng
[J]. NEUROCOMPUTING, 2024, 599
[4] Multi-Level Region Matching for Fine-Grained Sketch-Based Image Retrieval
Ling, Zhixin
Xing, Zhen
Li, Jiangtong
Niu, Li
[J]. PROCEEDINGS OF THE 30TH ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA, MM 2022, 2022,
[5] Fine-Grained Bidirectional Attention-Based Generative Networks for Image-Text Matching
Li, Zhixin
Zhu, Jianwei
Wei, Jiahui
Zeng, Yufei
[J]. MACHINE LEARNING AND KNOWLEDGE DISCOVERY IN DATABASES, ECML PKDD 2022, PT III, 2023, 13715 : 390 - 406
[6] Image-text matching algorithm based on multi-level semantic alignment
Li Y.
Yao T.
Zhang L.
Sun Y.
Fu H.
[J]. Beijing Hangkong Hangtian Daxue Xuebao/Journal of Beijing University of Aeronautics and Astronautics, 2024, 50 (02): : 551 - 558
[7] Fine-grained Image Caption based on Multi-level Attention
Yang Zhenyu
Zhang Jiao
[J]. PROCEEDINGS OF 2019 IEEE 7TH INTERNATIONAL CONFERENCE ON COMPUTER SCIENCE AND NETWORK TECHNOLOGY (ICCSNT 2019), 2019, : 72 - 78
[8] Multi-level information fusion Transformer with background filter for fine-grained image recognition
Yu, Ying
Wang, Jinghui
Pedrycz, Witold
Miao, Duoqian
Qian, Jin
[J]. APPLIED INTELLIGENCE, 2024, 54 (17-18) : 8108 - 8119
[9] TECMH: Transformer-Based Cross-Modal Hashing For Fine-Grained Image-Text Retrieval
Li, Qiqi
Ma, Longfei
Jiang, Zheng
Li, Mingyong
Jin, Bo
[J]. CMC-COMPUTERS MATERIALS & CONTINUA, 2023, 75 (02): : 3713 - 3728
[10] Learning Relationship-Enhanced Semantic Graph for Fine-Grained Image-Text Matching
Liu, Xin
He, Yi
Cheung, Yiu-Ming
Xu, Xing
Wang, Nannan
[J]. IEEE TRANSACTIONS ON CYBERNETICS, 2024, 54 (02) : 948 - 961

← 1 2 3 4 5 →