Modality-Invariant Image-Text Embedding for Image-Sentence Matching

被引:21
|
作者
Liu, Ruoyu [1 ]
Zhao, Yao [1 ]
Wei, Shikui [1 ]
Zheng, Liang [2 ]
Yang, Yi [3 ]
机构
[1] Beijing Jiaotong Univ, 3 Shuangyuancun, Beijing 100044, Peoples R China
[2] Australian Natl Univ, 115 North Rd, Acton, ACT 2601, Australia
[3] Univ Technol Sydney, 15 Broadway, Ultimo, NSW 2007, Australia
关键词
Image-text embedding; adversarial learning; retrieval;
D O I
10.1145/3300939
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
Performing direct matching among different modalities (like image and text) can benefit many tasks in computer vision, multimedia, information retrieval, and information fusion. Most of existing works focus on class-level image-text matching, called cross-modal retrieval, which attempts to propose a uniform model for matching images with all types of texts, for example, tags, sentences, and articles (long texts). Although cross-model retrieval alleviates the heterogeneous gap among visual and textual information, it can provide only a rough correspondence between two modalities. In this article, we propose a more precise image-text embedding method, image-sentence matching, which can provide heterogeneous matching in the instance level. The key issue for image-text embedding is how to make the distributions of the two modalities consistent in the embedding space. To address this problem, some previous works on the cross-model retrieval task have attempted to pull close their distributions by employing adversarial learning. However, the effectiveness of adversarial learning on image-sentence matching has not been proved and there is still not an effective method. Inspired by previous works, we propose to learn a modality-invariant image-text embedding for image-sentence matching by involving adversarial learning. On top of the triplet loss-based baseline, we design a modality classification network with an adversarial loss, which classifies an embedding into either the image or text modality. In addition, the multi-stage training procedure is carefully designed so that the proposed network not only imposes the image-text similarity constraints by ground-truth labels, but also enforces the image and text embedding distributions to be similar by adversarial learning. Experiments on two public datasets (Flickr30k and MSCOCO) demonstrate that our method yields stable accuracy improvement over the baseline model and that our results compare favorably to the state-of-the-art methods.
引用
收藏
页数:19
相关论文
共 50 条
  • [21] Hierarchical Knowledge-Based Graph Embedding Model for Image-Text Matching in IoTs
    Zhang, Lizong
    Li, Meng
    Yan, Ke
    Wang, Ruozhou
    Hui, Bei
    [J]. IEEE INTERNET OF THINGS JOURNAL, 2022, 9 (12) : 9399 - 9409
  • [22] Expressing Objects Just Like Words: Recurrent Visual Embedding for Image-Text Matching
    Chen, Tianlang
    Luo, Jiebo
    [J]. THIRTY-FOURTH AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE, THE THIRTY-SECOND INNOVATIVE APPLICATIONS OF ARTIFICIAL INTELLIGENCE CONFERENCE AND THE TENTH AAAI SYMPOSIUM ON EDUCATIONAL ADVANCES IN ARTIFICIAL INTELLIGENCE, 2020, 34 : 10583 - 10590
  • [23] Cross-Modal Hybrid Feature Fusion for Image-Sentence Matching
    Xu, Xing
    Wang, Yifan
    He, Yixuan
    Yang, Yang
    Hanjalic, Alan
    Shen, Heng Tao
    [J]. ACM TRANSACTIONS ON MULTIMEDIA COMPUTING COMMUNICATIONS AND APPLICATIONS, 2021, 17 (04)
  • [24] Giving Text More Imagination Space for Image-text Matching
    Dong, Xinfeng
    Han, Longfei
    Zhang, Dingwen
    Liu, Li
    Han, Junwei
    Zhang, Huaxiang
    [J]. PROCEEDINGS OF THE 31ST ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA, MM 2023, 2023, : 6359 - 6368
  • [25] More Grounded Image Captioning by Distilling Image-Text Matching Model
    Zhou, Yuanen
    Wang, Meng
    Liu, Daqing
    Hu, Zhenzhen
    Zhang, Hanwang
    [J]. 2020 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2020, : 4776 - 4785
  • [26] Structured Multi-modal Feature Embedding and Alignment for Image-Sentence Retrieval
    Ge, Xuri
    Chen, Fuhai
    Jose, Joemon M.
    Ji, Zhilong
    Wu, Zhongqin
    Liu, Xiao
    [J]. PROCEEDINGS OF THE 29TH ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA, MM 2021, 2021, : 5185 - 5193
  • [27] Towards Deconfounded Image-Text Matching with Causal Inference
    Li, Wenhui
    Su, Xinqi
    Song, Dan
    Wang, Lanjun
    Zhang, Kun
    Liu, An-An
    [J]. PROCEEDINGS OF THE 31ST ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA, MM 2023, 2023, : 6264 - 6273
  • [28] Hashing based Efficient Inference for Image-Text Matching
    Tu, Rong-Cheng
    Ji, Lei
    Luo, Huaishao
    Shi, Botian
    Huang, Heyan
    Duan, Nan
    Mao, Xian-Ling
    [J]. FINDINGS OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS, ACL-IJCNLP 2021, 2021, : 743 - 752
  • [29] Generating counterfactual negative samples for image-text matching
    Su, Xinqi
    Song, Dan
    Li, Wenhui
    Ren, Tongwei
    Liu, An-An
    [J]. Information Processing and Management, 2025, 62 (03):
  • [30] Multi-view inter-modality representation with progressive fusion for image-text matching
    Wu, Jie
    Wang, Leiquan
    Chen, Chenglizhao
    Lu, Jing
    Wu, Chunlei
    [J]. NEUROCOMPUTING, 2023, 535 : 1 - 12