SMAN: Stacked Multimodal Attention Network for Cross-Modal Image-Text Retrieval

被引：30

作者：

Ji, Zhong ^{[1
]}

Wang, Haoran ^{[1
]}

Han, Jungong ^{[2
]}

Pang, Yanwei ^{[1
]}

机构：

[1] Tianjin Univ, Sch Elect & Informat Engn, Tianjin 300072, Peoples R China

[2] Univ Warwick, Data Sci Grp, Coventry CV4 7AL, W Midlands, England

来源：

IEEE TRANSACTIONS ON CYBERNETICS | 2022年 / 52卷 / 02期

基金：

中国国家自然科学基金;

关键词：

Visualization; Semantics; Feature extraction; Correlation; Task analysis; Extraterrestrial measurements; Deep learning; Attention mechanism; cross-modal retrieval (CMR); multimodal learning; vision and language;

D O I：

10.1109/TCYB.2020.2985716

中图分类号：

TP [自动化技术、计算机技术];

学科分类号：

0812 ;

摘要：

This article focuses on tackling the task of the cross-modal image-text retrieval which has been an interdisciplinary topic in both computer vision and natural language processing communities. Existing global representation alignment-based methods fail to pinpoint the semantically meaningful portion of images and texts, while the local representation alignment schemes suffer from the huge computational burden for aggregating the similarity of visual fragments and textual words exhaustively. In this article, we propose a stacked multimodal attention network (SMAN) that makes use of the stacked multimodal attention mechanism to exploit the fine-grained interdependencies between image and text, thereby mapping the aggregation of attentive fragments into a common space for measuring cross-modal similarity. Specifically, we sequentially employ intramodal information and multimodal information as guidance to perform multiple-step attention reasoning so that the fine-grained correlation between image and text can be modeled. As a consequence, we are capable of discovering the semantically meaningful visual regions or words in a sentence which contributes to measuring the cross-modal similarity in a more precise manner. Moreover, we present a novel bidirectional ranking loss that enforces the distance among pairwise multimodal instances to be closer. Doing so allows us to make full use of pairwise supervised information to preserve the manifold structure of heterogeneous pairwise data. Extensive experiments on two benchmark datasets demonstrate that our SMAN consistently yields competitive performance compared to state-of-the-art methods.

引用

页码：1086 / 1097

页数：12

共 50 条

[21] Cross-modal Contrastive Learning for Generalizable and Efficient Image-text Retrieval
Haoyu Lu
Yuqi Huo
Mingyu Ding
Nanyi Fei
Zhiwu Lu
Machine Intelligence Research, 2023, 20 : 569 - 582
[22] Cross-modal Semantically Augmented Network for Image-text Matching
Yao, Tao
Li, Yiru
Li, Ying
Zhu, Yingying
Wang, Gang
Yue, Jun
ACM TRANSACTIONS ON MULTIMEDIA COMPUTING COMMUNICATIONS AND APPLICATIONS, 2024, 20 (04)
[23] Cross-modal Contrastive Learning for Generalizable and Efficient Image-text Retrieval
Lu, Haoyu
Huo, Yuqi
Ding, Mingyu
Fei, Nanyi
Lu, Zhiwu
MACHINE INTELLIGENCE RESEARCH, 2023, 20 (04) : 569 - 582
[24] Webly Supervised Joint Embedding for Cross-Modal Image-Text Retrieval
Mithun, Niluthpol Chowdhury
Panda, Rameswar
Papalexakis, Evangelos E.
Roy-Chowdhury, Amit K.
PROCEEDINGS OF THE 2018 ACM MULTIMEDIA CONFERENCE (MM'18), 2018, : 1856 - 1864
[25] RICH: A rapid method for image-text cross-modal hash retrieval
Li, Bo
Yao, Dan
Li, Zhixin
DISPLAYS, 2023, 79
[26] SAM: cross-modal semantic alignments module for image-text retrieval
Park, Pilseo
Jang, Soojin
Cho, Yunsung
Kim, Youngbin
MULTIMEDIA TOOLS AND APPLICATIONS, 2024, 83 (04) : 12363 - 12377
[27] An Enhanced Feature Extraction Framework for Cross-Modal Image-Text Retrieval
Zhang, Jinzhi
Wang, Luyao
Zheng, Fuzhong
Wang, Xu
Zhang, Haisu
REMOTE SENSING, 2024, 16 (12)
[28] Learning Hierarchical Semantic Correspondences for Cross-Modal Image-Text Retrieval
Zeng, Sheng
Liu, Changhong
Zhou, Jun
Chen, Yong
Jiang, Aiwen
Li, Hanxi
PROCEEDINGS OF THE 2022 INTERNATIONAL CONFERENCE ON MULTIMEDIA RETRIEVAL, ICMR 2022, 2022, : 239 - 248
[29] MGAN: Attempting a Multimodal Graph Attention Network for Remote Sensing Cross-Modal Text-Image Retrieval
Wang, Zhiming
Dong, Zhihua
Yang, Xiaoyu
Wang, Zhiguo
Yin, Guangqiang
PROCEEDINGS OF THE 13TH INTERNATIONAL CONFERENCE ON COMPUTER ENGINEERING AND NETWORKS, VOL II, CENET 2023, 2024, 1126 : 261 - 273
[30] Hierarchical modal interaction balance cross-modal hashing for unsupervised image-text retrieval
Zhang J.
Lin Z.
Jiang X.
Li M.
Wang C.
Multimedia Tools and Applications, 2024, 83 (42) : 90487 - 90509

← 1 2 3 4 5 →