Cross-modal multi-relationship aware reasoning for image-text matching

被引：2

作者：

Zhang, Jin ^{[1
]}

He, Xiaohai ^{[1
]}

Qing, Linbo ^{[1
]}

Liu, Luping ^{[1
]}

Luo, Xiaodong ^{[1
]}

机构：

[1] Sichuan Univ, Coll Elect & Informat Engn, Chengdu 610064, Sichuan, Peoples R China

来源：

MULTIMEDIA TOOLS AND APPLICATIONS | 2022年 / 81卷 / 09期

基金：

中国国家自然科学基金;

关键词：

Image-text matching; Visual multi-relationship; Graph neural network; Cross-modal retrieval; LANGUAGE; NETWORK;

D O I：

10.1007/s11042-020-10466-8

中图分类号：

TP [自动化技术、计算机技术];

学科分类号：

0812 ;

摘要：

Cross-modal image-text matching has attracted considerable interest in both computer vision and natural language processing communities. The main issue of image-text matching is to learn the compact cross-modal representations and the correlation between image and text representations. However, the image-text matching task has two major challenges. First, the current image representation methods focus on the semantic information and disregard the spatial position relations between image regions. Second, most existing methods pay little attention to improving textual representation which plays a significant role in image-text matching. To address these issues, we designed a decipherable cross-modal multi-relationship aware reasoning network (CMRN) for image-text matching. In particular, a new method is proposed to extract multi-relationship and to learn the correlations between image regions, including two kinds of visual relations: the geometric position relation and semantic interaction. In addition, images are processed as graphs, and a novel spatial relation encoder is introduced to perform reasoning on the graphs by employing a graph convolutional network (GCN) with attention mechanism. Thereafter, a contextual text encoder based on Bidirectional Encoder Representations from Transformers is adopted to learn distinctive textual representations. To verify the effectiveness of the proposed model, extensive experiments were conducted on two public datasets, namely MSCOCO and Flickr30K. The experimental results show that CMRN achieved superior performance when compared with state-of-the-art methods. On Flickr30K, the proposed method outperforms state-of-the-art methods more than 7.4% in text retrieval from image query, and 5.0% relatively in image retrieval with text query (based on Recall@1). On MSCOCO, the performance reaches 73.9% for text retrieval and 60.4% for image retrieval (based on Recall@1).

引用

页码：12005 / 12027

页数：23

共 50 条

[21] Cross-Modal Image-Text Retrieval with Semantic Consistency
Chen, Hui
Ding, Guiguang
Lin, Zijin
Zhao, Sicheng
Han, Jungong
PROCEEDINGS OF THE 27TH ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA (MM'19), 2019, : 1749 - 1757
[22] Fine-grained Image-text Matching by Cross-modal Hard Aligning Network
Pan, Zhengxin
Wu, Fangyu
Zhang, Bailing
2023 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2023, : 19275 - 19284
[23] Image-Text Embedding with Hierarchical Knowledge for Cross-Modal Retrieval
Seo, Sanghyun
Kim, Juntae
PROCEEDINGS OF 2018 THE 2ND INTERNATIONAL CONFERENCE ON COMPUTER SCIENCE AND ARTIFICIAL INTELLIGENCE (CSAI 2018) / 2018 THE 10TH INTERNATIONAL CONFERENCE ON INFORMATION AND MULTIMEDIA TECHNOLOGY (ICIMT 2018), 2018, : 350 - 353
[24] Image-Text Retrieval With Cross-Modal Semantic Importance Consistency
Liu, Zejun
Chen, Fanglin
Xu, Jun
Pei, Wenjie
Lu, Guangming
IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, 2023, 33 (05) : 2465 - 2476
[25] Joint feature approach for image-text cross-modal retrieval
Gao, Dihui
Sheng, Lijie
Xu, Xiaodong
Miao, Qiguang
Xi'an Dianzi Keji Daxue Xuebao/Journal of Xidian University, 2024, 51 (04): : 128 - 138
[26] Image-Text Cross-Modal Retrieval with Instance Contrastive Embedding
Zeng, Ruigeng
Ma, Wentao
Wu, Xiaoqian
Liu, Wei
Liu, Jie
ELECTRONICS, 2024, 13 (02)
[27] An Image-Text Matching Method for Multi-Modal Robots
Zheng, Ke
Li, Zhou
JOURNAL OF ORGANIZATIONAL AND END USER COMPUTING, 2024, 36 (01)
[28] Probability Distribution Representation Learning for Image-Text Cross-Modal Retrieval
Yang C.
Liu L.
Jisuanji Fuzhu Sheji Yu Tuxingxue Xuebao/Journal of Computer-Aided Design and Computer Graphics, 2022, 34 (05): : 751 - 759
[29] MULTI-SCALE INTERACTIVE TRANSFORMER FOR REMOTE SENSING CROSS-MODAL IMAGE-TEXT RETRIEVAL
Wang, Yijing
Ma, Jingjing
Li, Mingteng
Tang, Xu
Han, Xiao
Jiao, Licheng
2022 IEEE INTERNATIONAL GEOSCIENCE AND REMOTE SENSING SYMPOSIUM (IGARSS 2022), 2022, : 839 - 842
[30] SAM: cross-modal semantic alignments module for image-text retrieval
Pilseo Park
Soojin Jang
Yunsung Cho
Youngbin Kim
Multimedia Tools and Applications, 2024, 83 : 12363 - 12377

← 1 2 3 4 5 →