Modality-Invariant Image-Text Embedding for Image-Sentence Matching

被引:21
|
作者
Liu, Ruoyu [1 ]
Zhao, Yao [1 ]
Wei, Shikui [1 ]
Zheng, Liang [2 ]
Yang, Yi [3 ]
机构
[1] Beijing Jiaotong Univ, 3 Shuangyuancun, Beijing 100044, Peoples R China
[2] Australian Natl Univ, 115 North Rd, Acton, ACT 2601, Australia
[3] Univ Technol Sydney, 15 Broadway, Ultimo, NSW 2007, Australia
关键词
Image-text embedding; adversarial learning; retrieval;
D O I
10.1145/3300939
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
Performing direct matching among different modalities (like image and text) can benefit many tasks in computer vision, multimedia, information retrieval, and information fusion. Most of existing works focus on class-level image-text matching, called cross-modal retrieval, which attempts to propose a uniform model for matching images with all types of texts, for example, tags, sentences, and articles (long texts). Although cross-model retrieval alleviates the heterogeneous gap among visual and textual information, it can provide only a rough correspondence between two modalities. In this article, we propose a more precise image-text embedding method, image-sentence matching, which can provide heterogeneous matching in the instance level. The key issue for image-text embedding is how to make the distributions of the two modalities consistent in the embedding space. To address this problem, some previous works on the cross-model retrieval task have attempted to pull close their distributions by employing adversarial learning. However, the effectiveness of adversarial learning on image-sentence matching has not been proved and there is still not an effective method. Inspired by previous works, we propose to learn a modality-invariant image-text embedding for image-sentence matching by involving adversarial learning. On top of the triplet loss-based baseline, we design a modality classification network with an adversarial loss, which classifies an embedding into either the image or text modality. In addition, the multi-stage training procedure is carefully designed so that the proposed network not only imposes the image-text similarity constraints by ground-truth labels, but also enforces the image and text embedding distributions to be similar by adversarial learning. Experiments on two public datasets (Flickr30k and MSCOCO) demonstrate that our method yields stable accuracy improvement over the baseline model and that our results compare favorably to the state-of-the-art methods.
引用
收藏
页数:19
相关论文
共 50 条
  • [1] Hybrid Joint Embedding with Intra-Modality Loss for Image-Text Matching
    Ebaid, Doaa B.
    El-Zoghabi, Adel A.
    Madbouly, Magda M.
    [J]. 2022 9TH INTERNATIONAL CONFERENCE ON SOFT COMPUTING & MACHINE INTELLIGENCE, ISCMI, 2022, : 178 - 182
  • [2] Learning hierarchical embedding space for image-text matching
    Sun, Hao
    Qin, Xiaolin
    Liu, Xiaojing
    [J]. INTELLIGENT DATA ANALYSIS, 2024, 28 (03) : 647 - 665
  • [3] Location Attention Knowledge Embedding Model for Image-Text Matching
    Xu, Guoqing
    Hu, Min
    Wang, Xiaohua
    Yang, Jiaoyun
    Li, Nan
    Zhang, Qingyu
    [J]. PATTERN RECOGNITION AND COMPUTER VISION, PRCV 2023, PT I, 2024, 14425 : 408 - 421
  • [4] Dynamic Pruning of Regions for Image-Sentence Matching
    Wu, Jie
    Liu, Weifeng
    Wang, Leiquan
    Shen, Xiuxuan
    Wei, Yiwei
    Wu, Chunlei
    [J]. SIGNAL PROCESSING-IMAGE COMMUNICATION, 2023, 117
  • [5] Modality-Invariant Structural Feature Representation for Multimodal Remote Sensing Image Matching
    Fan, Jianwei
    Xiong, Qing
    Li, Jian
    Ye, Yuanxin
    [J]. IEEE GEOSCIENCE AND REMOTE SENSING LETTERS, 2023, 20
  • [6] Conditional Image-Text Embedding Networks
    Plummer, Bryan A.
    Kordas, Paige
    Kiapour, M. Hadi
    Zheng, Shuai
    Piramuthu, Robinson
    Lazebnik, Svetlana
    [J]. COMPUTER VISION - ECCV 2018, PT XII, 2018, 11216 : 258 - 274
  • [7] CycleMatch: A cycle-consistent embedding network for image-text matching
    Liu, Yu
    Guo, Yanming
    Liu, Li
    Bakker, Erwin M.
    Lew, Michael S.
    [J]. PATTERN RECOGNITION, 2019, 93 : 365 - 379
  • [8] Regularizing Visual Semantic Embedding With Contrastive Learning for Image-Text Matching
    Liu, Yang
    Liu, Hong
    Wang, Huaqiu
    Liu, Mengyuan
    [J]. IEEE SIGNAL PROCESSING LETTERS, 2022, 29 : 1332 - 1336
  • [9] Modality-Invariant Image Classification Based on Modality Uniqueness and Dictionary Learning
    Kim, Seungryong
    Cai, Rui
    Park, Kihong
    Kim, Sunok
    Sohn, Kwanghoon
    [J]. IEEE TRANSACTIONS ON IMAGE PROCESSING, 2017, 26 (02) : 884 - 899
  • [10] Saliency-Guided Attention Network for Image-Sentence Matching
    Ji, Zhong
    Wang, Haoran
    Han, Jungong
    Pang, Yanwei
    [J]. 2019 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2019), 2019, : 5753 - 5762