Cross-modal multi-relationship aware reasoning for image-text matching

被引：2

作者：

Zhang, Jin ^{[1
]}

He, Xiaohai ^{[1
]}

Qing, Linbo ^{[1
]}

Liu, Luping ^{[1
]}

Luo, Xiaodong ^{[1
]}

机构：

[1] Sichuan Univ, Coll Elect & Informat Engn, Chengdu 610064, Sichuan, Peoples R China

来源：

MULTIMEDIA TOOLS AND APPLICATIONS | 2022年 / 81卷 / 09期

基金：

中国国家自然科学基金;

关键词：

Image-text matching; Visual multi-relationship; Graph neural network; Cross-modal retrieval; LANGUAGE; NETWORK;

D O I：

10.1007/s11042-020-10466-8

中图分类号：

TP [自动化技术、计算机技术];

学科分类号：

0812 ;

摘要：

Cross-modal image-text matching has attracted considerable interest in both computer vision and natural language processing communities. The main issue of image-text matching is to learn the compact cross-modal representations and the correlation between image and text representations. However, the image-text matching task has two major challenges. First, the current image representation methods focus on the semantic information and disregard the spatial position relations between image regions. Second, most existing methods pay little attention to improving textual representation which plays a significant role in image-text matching. To address these issues, we designed a decipherable cross-modal multi-relationship aware reasoning network (CMRN) for image-text matching. In particular, a new method is proposed to extract multi-relationship and to learn the correlations between image regions, including two kinds of visual relations: the geometric position relation and semantic interaction. In addition, images are processed as graphs, and a novel spatial relation encoder is introduced to perform reasoning on the graphs by employing a graph convolutional network (GCN) with attention mechanism. Thereafter, a contextual text encoder based on Bidirectional Encoder Representations from Transformers is adopted to learn distinctive textual representations. To verify the effectiveness of the proposed model, extensive experiments were conducted on two public datasets, namely MSCOCO and Flickr30K. The experimental results show that CMRN achieved superior performance when compared with state-of-the-art methods. On Flickr30K, the proposed method outperforms state-of-the-art methods more than 7.4% in text retrieval from image query, and 5.0% relatively in image retrieval with text query (based on Recall@1). On MSCOCO, the performance reaches 73.9% for text retrieval and 60.4% for image retrieval (based on Recall@1).

引用

页码：12005 / 12027

页数：23

共 50 条

[31] Cross-modal Contrastive Learning for Generalizable and Efficient Image-text Retrieval
Haoyu Lu
Yuqi Huo
Mingyu Ding
Nanyi Fei
Zhiwu Lu
Machine Intelligence Research, 2023, 20 : 569 - 582
[32] Heterogeneous Graph Fusion Network for cross-modal image-text retrieval
Qin, Xueyang
Li, Lishuang
Pang, Guangyao
Hao, Fei
EXPERT SYSTEMS WITH APPLICATIONS, 2024, 249
[33] Image-text bidirectional learning network based cross-modal retrieval
Li, Zhuoyi
Lu, Huibin
Fu, Hao
Gu, Guanghua
NEUROCOMPUTING, 2022, 483 : 148 - 159
[34] Cross-modal Contrastive Learning for Generalizable and Efficient Image-text Retrieval
Lu, Haoyu
Huo, Yuqi
Ding, Mingyu
Fei, Nanyi
Lu, Zhiwu
MACHINE INTELLIGENCE RESEARCH, 2023, 20 (04) : 569 - 582
[35] Webly Supervised Joint Embedding for Cross-Modal Image-Text Retrieval
Mithun, Niluthpol Chowdhury
Panda, Rameswar
Papalexakis, Evangelos E.
Roy-Chowdhury, Amit K.
PROCEEDINGS OF THE 2018 ACM MULTIMEDIA CONFERENCE (MM'18), 2018, : 1856 - 1864
[36] RICH: A rapid method for image-text cross-modal hash retrieval
Li, Bo
Yao, Dan
Li, Zhixin
DISPLAYS, 2023, 79
[37] More Than Just Attention: Improving Cross-Modal Attentions with Contrastive Constraints for Image-Text Matching
Chen, Yuxiao
Yuan, Jianbo
Zhao, Long
Chen, Tianlang
Luo, Rui
Davis, Larry
Metaxas, Dimitris N.
2023 IEEE/CVF WINTER CONFERENCE ON APPLICATIONS OF COMPUTER VISION (WACV), 2023, : 4421 - 4429
[38] SAM: cross-modal semantic alignments module for image-text retrieval
Park, Pilseo
Jang, Soojin
Cho, Yunsung
Kim, Youngbin
MULTIMEDIA TOOLS AND APPLICATIONS, 2024, 83 (04) : 12363 - 12377
[39] An Enhanced Feature Extraction Framework for Cross-Modal Image-Text Retrieval
Zhang, Jinzhi
Wang, Luyao
Zheng, Fuzhong
Wang, Xu
Zhang, Haisu
REMOTE SENSING, 2024, 16 (12)
[40] Learning Hierarchical Semantic Correspondences for Cross-Modal Image-Text Retrieval
Zeng, Sheng
Liu, Changhong
Zhou, Jun
Chen, Yong
Jiang, Aiwen
Li, Hanxi
PROCEEDINGS OF THE 2022 INTERNATIONAL CONFERENCE ON MULTIMEDIA RETRIEVAL, ICMR 2022, 2022, : 239 - 248

← 1 2 3 4 5 →