Modality-Invariant Image-Text Embedding for Image-Sentence Matching

被引:21
|
作者
Liu, Ruoyu [1 ]
Zhao, Yao [1 ]
Wei, Shikui [1 ]
Zheng, Liang [2 ]
Yang, Yi [3 ]
机构
[1] Beijing Jiaotong Univ, 3 Shuangyuancun, Beijing 100044, Peoples R China
[2] Australian Natl Univ, 115 North Rd, Acton, ACT 2601, Australia
[3] Univ Technol Sydney, 15 Broadway, Ultimo, NSW 2007, Australia
关键词
Image-text embedding; adversarial learning; retrieval;
D O I
10.1145/3300939
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
Performing direct matching among different modalities (like image and text) can benefit many tasks in computer vision, multimedia, information retrieval, and information fusion. Most of existing works focus on class-level image-text matching, called cross-modal retrieval, which attempts to propose a uniform model for matching images with all types of texts, for example, tags, sentences, and articles (long texts). Although cross-model retrieval alleviates the heterogeneous gap among visual and textual information, it can provide only a rough correspondence between two modalities. In this article, we propose a more precise image-text embedding method, image-sentence matching, which can provide heterogeneous matching in the instance level. The key issue for image-text embedding is how to make the distributions of the two modalities consistent in the embedding space. To address this problem, some previous works on the cross-model retrieval task have attempted to pull close their distributions by employing adversarial learning. However, the effectiveness of adversarial learning on image-sentence matching has not been proved and there is still not an effective method. Inspired by previous works, we propose to learn a modality-invariant image-text embedding for image-sentence matching by involving adversarial learning. On top of the triplet loss-based baseline, we design a modality classification network with an adversarial loss, which classifies an embedding into either the image or text modality. In addition, the multi-stage training procedure is carefully designed so that the proposed network not only imposes the image-text similarity constraints by ground-truth labels, but also enforces the image and text embedding distributions to be similar by adversarial learning. Experiments on two public datasets (Flickr30k and MSCOCO) demonstrate that our method yields stable accuracy improvement over the baseline model and that our results compare favorably to the state-of-the-art methods.
引用
收藏
页数:19
相关论文
共 50 条
  • [31] A Survey on Deeр Learning Based Image-Text Matching
    Liu, Meng
    Qi, Meng-Jin
    Zhan, Zhen-Yu
    Qu, Lei-Gang
    Nie, Xiu-Shan
    Nie, Li-Qiang
    [J]. Jisuanji Xuebao/Chinese Journal of Computers, 2023, 46 (11): : 2370 - 2399
  • [32] A NEIGHBOR-AWARE APPROACH FOR IMAGE-TEXT MATCHING
    Liu, Chunxiao
    Mao, Zhendong
    Zang, Wenyu
    Wang, Bin
    [J]. 2019 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2019, : 3970 - 3974
  • [33] Similarity Contrastive Capsule Transformation for Image-Text Matching
    Zhang, Bin
    Sun, Ximin
    Li, Xiaoming
    Wang, Shuai
    Liu, Dan
    Jia, Jiangkai
    [J]. 2023 9TH INTERNATIONAL CONFERENCE ON MECHATRONICS AND ROBOTICS ENGINEERING, ICMRE, 2023, : 84 - 90
  • [34] Transformer Reasoning Network for Image-Text Matching and Retrieval
    Messina, Nicola
    Falchi, Fabrizio
    Esuli, Andrea
    Amato, Giuseppe
    [J]. 2020 25TH INTERNATIONAL CONFERENCE ON PATTERN RECOGNITION (ICPR), 2021, : 5222 - 5229
  • [35] News Image-Text Matching With News Knowledge Graph
    Zhao Yumeng
    Yun Jing
    Gao Shuo
    Liu Limin
    [J]. IEEE ACCESS, 2021, 9 : 108017 - 108027
  • [36] Synthesizing Counterfactual Samples for Effective Image-Text Matching
    Wei, Hao
    Wang, Shuhui
    Han, Xinzhe
    Xue, Zhe
    Ma, Bin
    Wei, Xiaoming
    Wei, Xiaolin
    [J]. PROCEEDINGS OF THE 30TH ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA, MM 2022, 2022, : 4355 - 4364
  • [37] Position Focused Attention Network for Image-Text Matching
    Wang, Yaxiong
    Yang, Hao
    Qian, Xueming
    Ma, Lin
    Lu, Jing
    Li, Biao
    Fan, Xin
    [J]. PROCEEDINGS OF THE TWENTY-EIGHTH INTERNATIONAL JOINT CONFERENCE ON ARTIFICIAL INTELLIGENCE, 2019, : 3792 - 3798
  • [38] Plug-and-Play Regulators for Image-Text Matching
    Diao, Haiwen
    Zhang, Ying
    Liu, Wei
    Ruan, Xiang
    Lu, Huchuan
    [J]. IEEE TRANSACTIONS ON IMAGE PROCESSING, 2023, 32 : 2322 - 2334
  • [39] Generative label fused network for image-text matching
    Zhao, Guoshuai
    Zhang, Chaofeng
    Shang, Heng
    Wang, Yaxiong
    Zhu, Li
    Qian, Xueming
    [J]. KNOWLEDGE-BASED SYSTEMS, 2023, 263
  • [40] PSYCHOPHYSIOLOGICAL STUDIES ON PARADIGM OF IMAGE-SENTENCE COMPARISON
    KLIX, F
    REBENTISCH, E
    [J]. ZEITSCHRIFT FUR PSYCHOLOGIE, 1976, 184 (03): : 445 - 449