Cross-modal multi-relationship aware reasoning for image-text matching

被引:2
|
作者
Zhang, Jin [1 ]
He, Xiaohai [1 ]
Qing, Linbo [1 ]
Liu, Luping [1 ]
Luo, Xiaodong [1 ]
机构
[1] Sichuan Univ, Coll Elect & Informat Engn, Chengdu 610064, Sichuan, Peoples R China
基金
中国国家自然科学基金;
关键词
Image-text matching; Visual multi-relationship; Graph neural network; Cross-modal retrieval; LANGUAGE; NETWORK;
D O I
10.1007/s11042-020-10466-8
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
Cross-modal image-text matching has attracted considerable interest in both computer vision and natural language processing communities. The main issue of image-text matching is to learn the compact cross-modal representations and the correlation between image and text representations. However, the image-text matching task has two major challenges. First, the current image representation methods focus on the semantic information and disregard the spatial position relations between image regions. Second, most existing methods pay little attention to improving textual representation which plays a significant role in image-text matching. To address these issues, we designed a decipherable cross-modal multi-relationship aware reasoning network (CMRN) for image-text matching. In particular, a new method is proposed to extract multi-relationship and to learn the correlations between image regions, including two kinds of visual relations: the geometric position relation and semantic interaction. In addition, images are processed as graphs, and a novel spatial relation encoder is introduced to perform reasoning on the graphs by employing a graph convolutional network (GCN) with attention mechanism. Thereafter, a contextual text encoder based on Bidirectional Encoder Representations from Transformers is adopted to learn distinctive textual representations. To verify the effectiveness of the proposed model, extensive experiments were conducted on two public datasets, namely MSCOCO and Flickr30K. The experimental results show that CMRN achieved superior performance when compared with state-of-the-art methods. On Flickr30K, the proposed method outperforms state-of-the-art methods more than 7.4% in text retrieval from image query, and 5.0% relatively in image retrieval with text query (based on Recall@1). On MSCOCO, the performance reaches 73.9% for text retrieval and 60.4% for image retrieval (based on Recall@1).
引用
收藏
页码:12005 / 12027
页数:23
相关论文
共 50 条
  • [1] Cross-modal multi-relationship aware reasoning for image-text matching
    Jin Zhang
    Xiaohai He
    Linbo Qing
    Luping Liu
    Xiaodong Luo
    Multimedia Tools and Applications, 2022, 81 : 12005 - 12027
  • [2] Cross-modal Scene Graph Matching for Relationship-aware Image-Text Retrieval
    Wang, Sijin
    Wang, Ruiping
    Yao, Ziwei
    Shan, Shiguang
    Chen, Xilin
    2020 IEEE WINTER CONFERENCE ON APPLICATIONS OF COMPUTER VISION (WACV), 2020, : 1497 - 1506
  • [3] Cross-modal information balance-aware reasoning network for image-text retrieval
    Qin, Xueyang
    Li, Lishuang
    Hao, Fei
    Pang, Guangyao
    Wang, Zehao
    ENGINEERING APPLICATIONS OF ARTIFICIAL INTELLIGENCE, 2023, 120
  • [4] Cross-modal alignment with graph reasoning for image-text retrieval
    Zheng Cui
    Yongli Hu
    Yanfeng Sun
    Junbin Gao
    Baocai Yin
    Multimedia Tools and Applications, 2022, 81 : 23615 - 23632
  • [5] Cross-modal alignment with graph reasoning for image-text retrieval
    Cui, Zheng
    Hu, Yongli
    Sun, Yanfeng
    Gao, Junbin
    Yin, Baocai
    MULTIMEDIA TOOLS AND APPLICATIONS, 2022, 81 (17) : 23615 - 23632
  • [6] Deep Cross-Modal Projection Learning for Image-Text Matching
    Zhang, Ying
    Lu, Huchuan
    COMPUTER VISION - ECCV 2018, PT I, 2018, 11205 : 707 - 723
  • [7] Cross-modal Semantically Augmented Network for Image-text Matching
    Yao, Tao
    Li, Yiru
    Li, Ying
    Zhu, Yingying
    Wang, Gang
    Yue, Jun
    ACM TRANSACTIONS ON MULTIMEDIA COMPUTING COMMUNICATIONS AND APPLICATIONS, 2024, 20 (04)
  • [8] Cross-Modal Attention With Semantic Consistence for Image-Text Matching
    Xu, Xing
    Wang, Tan
    Yang, Yang
    Zuo, Lin
    Shen, Fumin
    Shen, Heng Tao
    IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, 2020, 31 (12) : 5412 - 5425
  • [9] Cross-modal Semantic Interference Suppression for image-text matching
    Yao, Tao
    Peng, Shouyong
    Sun, Yujuan
    Sheng, Guorui
    Fu, Haiyan
    Kong, Xiangwei
    ENGINEERING APPLICATIONS OF ARTIFICIAL INTELLIGENCE, 2024, 133
  • [10] Cross-modal Graph Matching Network for Image-text Retrieval
    Cheng, Yuhao
    Zhu, Xiaoguang
    Qian, Jiuchao
    Wen, Fei
    Liu, Peilin
    ACM TRANSACTIONS ON MULTIMEDIA COMPUTING COMMUNICATIONS AND APPLICATIONS, 2022, 18 (04)