Cross-modal multi-relationship aware reasoning for image-text matching

被引:2
|
作者
Zhang, Jin [1 ]
He, Xiaohai [1 ]
Qing, Linbo [1 ]
Liu, Luping [1 ]
Luo, Xiaodong [1 ]
机构
[1] Sichuan Univ, Coll Elect & Informat Engn, Chengdu 610064, Sichuan, Peoples R China
基金
中国国家自然科学基金;
关键词
Image-text matching; Visual multi-relationship; Graph neural network; Cross-modal retrieval; LANGUAGE; NETWORK;
D O I
10.1007/s11042-020-10466-8
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
Cross-modal image-text matching has attracted considerable interest in both computer vision and natural language processing communities. The main issue of image-text matching is to learn the compact cross-modal representations and the correlation between image and text representations. However, the image-text matching task has two major challenges. First, the current image representation methods focus on the semantic information and disregard the spatial position relations between image regions. Second, most existing methods pay little attention to improving textual representation which plays a significant role in image-text matching. To address these issues, we designed a decipherable cross-modal multi-relationship aware reasoning network (CMRN) for image-text matching. In particular, a new method is proposed to extract multi-relationship and to learn the correlations between image regions, including two kinds of visual relations: the geometric position relation and semantic interaction. In addition, images are processed as graphs, and a novel spatial relation encoder is introduced to perform reasoning on the graphs by employing a graph convolutional network (GCN) with attention mechanism. Thereafter, a contextual text encoder based on Bidirectional Encoder Representations from Transformers is adopted to learn distinctive textual representations. To verify the effectiveness of the proposed model, extensive experiments were conducted on two public datasets, namely MSCOCO and Flickr30K. The experimental results show that CMRN achieved superior performance when compared with state-of-the-art methods. On Flickr30K, the proposed method outperforms state-of-the-art methods more than 7.4% in text retrieval from image query, and 5.0% relatively in image retrieval with text query (based on Recall@1). On MSCOCO, the performance reaches 73.9% for text retrieval and 60.4% for image retrieval (based on Recall@1).
引用
收藏
页码:12005 / 12027
页数:23
相关论文
共 50 条
  • [11] Cross-modal Semantic Interference Suppression for image-text matching
    Yao, Tao
    Peng, Shouyong
    Sun, Yujuan
    Sheng, Guorui
    Fu, Haiyan
    Kong, Xiangwei
    Engineering Applications of Artificial Intelligence, 2024, 133
  • [12] Show Your Faith: Cross-Modal Confidence-Aware Network for Image-Text Matching
    Zhang, Huatian
    Mao, Zhendong
    Zhang, Kun
    Zhang, Yongdong
    THIRTY-SIXTH AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE / THIRTY-FOURTH CONFERENCE ON INNOVATIVE APPLICATIONS OF ARTIFICIAL INTELLIGENCE / THE TWELVETH SYMPOSIUM ON EDUCATIONAL ADVANCES IN ARTIFICIAL INTELLIGENCE, 2022, : 3262 - 3270
  • [13] Improving Image-Text Matching With Bidirectional Consistency of Cross-Modal Alignment
    Li, Zhe
    Zhang, Lei
    Zhang, Kun
    Zhang, Yongdong
    Mao, Zhendong
    IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, 2024, 34 (07) : 6590 - 6607
  • [14] Visual Contextual Semantic Reasoning for Cross-Modal Drone Image-Text Retrieval
    Huang, Jinghao
    Chen, Yaxiong
    Xiong, Shengwu
    Lu, Xiaoqiang
    IEEE TRANSACTIONS ON GEOSCIENCE AND REMOTE SENSING, 2024, 62
  • [15] MiC: Image-text Matching in Circles with cross-modal generative knowledge enhancement
    Pu, Xiao
    Chen, Yuwen
    Yuan, Lin
    Zhang, Yan
    Li, Hongbo
    Jing, Liping
    Gao, Xinbo
    KNOWLEDGE-BASED SYSTEMS, 2024, 289
  • [16] Cross-Modal Image-Text Matching via Coupled Projection Learning Hashing
    Zhao, Huan
    Wang, Haoqian
    Zha, Xupeng
    Wang, Song
    2022 IEEE 9TH INTERNATIONAL CONFERENCE ON DATA SCIENCE AND ADVANCED ANALYTICS (DSAA), 2022, : 367 - 376
  • [17] Cross-modal Image-Text Retrieval with Multitask Learning
    Luo, Junyu
    Shen, Ying
    Ao, Xiang
    Zhao, Zhou
    Yang, Min
    PROCEEDINGS OF THE 28TH ACM INTERNATIONAL CONFERENCE ON INFORMATION & KNOWLEDGE MANAGEMENT (CIKM '19), 2019, : 2309 - 2312
  • [18] Rethinking Benchmarks for Cross-modal Image-text Retrieval
    Chen, Weijing
    Yao, Linli
    Jin, Qin
    PROCEEDINGS OF THE 46TH INTERNATIONAL ACM SIGIR CONFERENCE ON RESEARCH AND DEVELOPMENT IN INFORMATION RETRIEVAL, SIGIR 2023, 2023, : 1241 - 1251
  • [19] Adaptive Cross-Modal Embeddings for Image-Text Alignment
    Wehrmann, Pinatas
    Kolling, Camila
    Barros, Rodrigo C.
    THIRTY-FOURTH AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE, THE THIRTY-SECOND INNOVATIVE APPLICATIONS OF ARTIFICIAL INTELLIGENCE CONFERENCE AND THE TENTH AAAI SYMPOSIUM ON EDUCATIONAL ADVANCES IN ARTIFICIAL INTELLIGENCE, 2020, 34 : 12313 - 12320
  • [20] Conceptual and Syntactical Cross-modal Alignment with Cross-level Consistency for Image-Text Matching
    Zeng, Pengpeng
    Gao, Lianli
    Lyu, Xinyu
    Jing, Shuaiqi
    Song, Jingkuan
    PROCEEDINGS OF THE 29TH ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA, MM 2021, 2021, : 2205 - 2213