Modality-Invariant Image-Text Embedding for Image-Sentence Matching

被引：21

作者：

Liu, Ruoyu ^{[1
]}

Zhao, Yao ^{[1
]}

Wei, Shikui ^{[1
]}

Zheng, Liang ^{[2
]}

Yang, Yi ^{[3
]}

机构：

[1] Beijing Jiaotong Univ, 3 Shuangyuancun, Beijing 100044, Peoples R China

[2] Australian Natl Univ, 115 North Rd, Acton, ACT 2601, Australia

[3] Univ Technol Sydney, 15 Broadway, Ultimo, NSW 2007, Australia

来源：

ACM TRANSACTIONS ON MULTIMEDIA COMPUTING COMMUNICATIONS AND APPLICATIONS | 2019年 / 15卷 / 01期

关键词：

Image-text embedding; adversarial learning; retrieval;

D O I：

10.1145/3300939

中图分类号：

TP [自动化技术、计算机技术];

学科分类号：

0812 ;

摘要：

Performing direct matching among different modalities (like image and text) can benefit many tasks in computer vision, multimedia, information retrieval, and information fusion. Most of existing works focus on class-level image-text matching, called cross-modal retrieval, which attempts to propose a uniform model for matching images with all types of texts, for example, tags, sentences, and articles (long texts). Although cross-model retrieval alleviates the heterogeneous gap among visual and textual information, it can provide only a rough correspondence between two modalities. In this article, we propose a more precise image-text embedding method, image-sentence matching, which can provide heterogeneous matching in the instance level. The key issue for image-text embedding is how to make the distributions of the two modalities consistent in the embedding space. To address this problem, some previous works on the cross-model retrieval task have attempted to pull close their distributions by employing adversarial learning. However, the effectiveness of adversarial learning on image-sentence matching has not been proved and there is still not an effective method. Inspired by previous works, we propose to learn a modality-invariant image-text embedding for image-sentence matching by involving adversarial learning. On top of the triplet loss-based baseline, we design a modality classification network with an adversarial loss, which classifies an embedding into either the image or text modality. In addition, the multi-stage training procedure is carefully designed so that the proposed network not only imposes the image-text similarity constraints by ground-truth labels, but also enforces the image and text embedding distributions to be similar by adversarial learning. Experiments on two public datasets (Flickr30k and MSCOCO) demonstrate that our method yields stable accuracy improvement over the baseline model and that our results compare favorably to the state-of-the-art methods.

引用

页数：19

共 50 条

[21] Hierarchical Knowledge-Based Graph Embedding Model for Image-Text Matching in IoTs
Zhang, Lizong
Li, Meng
Yan, Ke
Wang, Ruozhou
Hui, Bei
[J]. IEEE INTERNET OF THINGS JOURNAL, 2022, 9 (12) : 9399 - 9409
[22] Expressing Objects Just Like Words: Recurrent Visual Embedding for Image-Text Matching
Chen, Tianlang
Luo, Jiebo
[J]. THIRTY-FOURTH AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE, THE THIRTY-SECOND INNOVATIVE APPLICATIONS OF ARTIFICIAL INTELLIGENCE CONFERENCE AND THE TENTH AAAI SYMPOSIUM ON EDUCATIONAL ADVANCES IN ARTIFICIAL INTELLIGENCE, 2020, 34 : 10583 - 10590
[23] Cross-Modal Hybrid Feature Fusion for Image-Sentence Matching
Xu, Xing
Wang, Yifan
He, Yixuan
Yang, Yang
Hanjalic, Alan
Shen, Heng Tao
[J]. ACM TRANSACTIONS ON MULTIMEDIA COMPUTING COMMUNICATIONS AND APPLICATIONS, 2021, 17 (04)
[24] Giving Text More Imagination Space for Image-text Matching
Dong, Xinfeng
Han, Longfei
Zhang, Dingwen
Liu, Li
Han, Junwei
Zhang, Huaxiang
[J]. PROCEEDINGS OF THE 31ST ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA, MM 2023, 2023, : 6359 - 6368
[25] More Grounded Image Captioning by Distilling Image-Text Matching Model
Zhou, Yuanen
Wang, Meng
Liu, Daqing
Hu, Zhenzhen
Zhang, Hanwang
[J]. 2020 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2020, : 4776 - 4785
[26] Structured Multi-modal Feature Embedding and Alignment for Image-Sentence Retrieval
Ge, Xuri
Chen, Fuhai
Jose, Joemon M.
Ji, Zhilong
Wu, Zhongqin
Liu, Xiao
[J]. PROCEEDINGS OF THE 29TH ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA, MM 2021, 2021, : 5185 - 5193
[27] Towards Deconfounded Image-Text Matching with Causal Inference
Li, Wenhui
Su, Xinqi
Song, Dan
Wang, Lanjun
Zhang, Kun
Liu, An-An
[J]. PROCEEDINGS OF THE 31ST ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA, MM 2023, 2023, : 6264 - 6273
[28] Hashing based Efficient Inference for Image-Text Matching
Tu, Rong-Cheng
Ji, Lei
Luo, Huaishao
Shi, Botian
Huang, Heyan
Duan, Nan
Mao, Xian-Ling
[J]. FINDINGS OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS, ACL-IJCNLP 2021, 2021, : 743 - 752
[29] Generating counterfactual negative samples for image-text matching
Su, Xinqi
Song, Dan
Li, Wenhui
Ren, Tongwei
Liu, An-An
[J]. Information Processing and Management, 2025, 62 (03):
[30] Multi-view inter-modality representation with progressive fusion for image-text matching
Wu, Jie
Wang, Leiquan
Chen, Chenglizhao
Lu, Jing
Wu, Chunlei
[J]. NEUROCOMPUTING, 2023, 535 : 1 - 12

← 1 2 3 4 5 →