Modality-Invariant Image-Text Embedding for Image-Sentence Matching

被引:21
|
作者
Liu, Ruoyu [1 ]
Zhao, Yao [1 ]
Wei, Shikui [1 ]
Zheng, Liang [2 ]
Yang, Yi [3 ]
机构
[1] Beijing Jiaotong Univ, 3 Shuangyuancun, Beijing 100044, Peoples R China
[2] Australian Natl Univ, 115 North Rd, Acton, ACT 2601, Australia
[3] Univ Technol Sydney, 15 Broadway, Ultimo, NSW 2007, Australia
关键词
Image-text embedding; adversarial learning; retrieval;
D O I
10.1145/3300939
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
Performing direct matching among different modalities (like image and text) can benefit many tasks in computer vision, multimedia, information retrieval, and information fusion. Most of existing works focus on class-level image-text matching, called cross-modal retrieval, which attempts to propose a uniform model for matching images with all types of texts, for example, tags, sentences, and articles (long texts). Although cross-model retrieval alleviates the heterogeneous gap among visual and textual information, it can provide only a rough correspondence between two modalities. In this article, we propose a more precise image-text embedding method, image-sentence matching, which can provide heterogeneous matching in the instance level. The key issue for image-text embedding is how to make the distributions of the two modalities consistent in the embedding space. To address this problem, some previous works on the cross-model retrieval task have attempted to pull close their distributions by employing adversarial learning. However, the effectiveness of adversarial learning on image-sentence matching has not been proved and there is still not an effective method. Inspired by previous works, we propose to learn a modality-invariant image-text embedding for image-sentence matching by involving adversarial learning. On top of the triplet loss-based baseline, we design a modality classification network with an adversarial loss, which classifies an embedding into either the image or text modality. In addition, the multi-stage training procedure is carefully designed so that the proposed network not only imposes the image-text similarity constraints by ground-truth labels, but also enforces the image and text embedding distributions to be similar by adversarial learning. Experiments on two public datasets (Flickr30k and MSCOCO) demonstrate that our method yields stable accuracy improvement over the baseline model and that our results compare favorably to the state-of-the-art methods.
引用
收藏
页数:19
相关论文
共 50 条
  • [41] JECL: Joint Embedding and Cluster Learning for Image-Text Pairs
    Yang, Sean T.
    Huang, Kuan-Hao
    Howe, Bill
    [J]. 2020 25TH INTERNATIONAL CONFERENCE ON PATTERN RECOGNITION (ICPR), 2021, : 8344 - 8351
  • [42] Is An Image Worth Five Sentences? A New Look into Semantics for Image-Text Matching
    Biten, Ali Furkan
    Mafla, Andres
    Gomez, Lluis
    Karatzas, Dimosthenis
    [J]. 2022 IEEE WINTER CONFERENCE ON APPLICATIONS OF COMPUTER VISION (WACV 2022), 2022, : 2483 - 2492
  • [43] Webly Supervised Image-Text Embedding with Noisy Tag Refinement
    Mithun, Niluthpol C.
    Pasricha, Ravdeep
    Papalexakis, Evangelos
    Roy-Chowdhury, Amit K.
    [J]. 2020 25TH INTERNATIONAL CONFERENCE ON PATTERN RECOGNITION (ICPR), 2021, : 7454 - 7461
  • [44] Action-Aware Embedding Enhancement for Image-Text Retrieval
    Li, Jiangtong
    Niu, Li
    Zhang, Liqing
    [J]. THIRTY-SIXTH AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE / THIRTY-FOURTH CONFERENCE ON INNOVATIVE APPLICATIONS OF ARTIFICIAL INTELLIGENCE / THE TWELVETH SYMPOSIUM ON EDUCATIONAL ADVANCES IN ARTIFICIAL INTELLIGENCE, 2022, : 1323 - 1331
  • [45] Cross-Modality Image Matching Network With Modality-Invariant Feature Representation for Airborne-Ground Thermal Infrared and Visible Datasets
    Cui, Song
    Ma, Ailong
    Wan, Yuting
    Zhong, Yanfei
    Luo, Bin
    Xu, Miaozhong
    [J]. IEEE TRANSACTIONS ON GEOSCIENCE AND REMOTE SENSING, 2022, 60
  • [46] Estimating the Semantics via Sector Embedding for Image-Text Retrieval
    Wang, Zheng
    Gao, Zhenwei
    Han, Mengqun
    Yang, Yang
    Shen, Heng Tao
    [J]. IEEE Transactions on Multimedia, 2024, 26 : 10342 - 10353
  • [47] Cross-Modality Image Matching Network with Modality-Invariant Feature Representation for Airborne-Ground Thermal Infrared and Visible Datasets
    Cui, Song
    Ma, Ailong
    Wan, Yuting
    Zhong, Yanfei
    Luo, Bin
    Xu, Miaozhong
    [J]. IEEE Transactions on Geoscience and Remote Sensing, 2022, 60
  • [48] Image-Text Interaction
    Strothotte, Thomas
    [J]. 2007 INTERNATIONAL CONFERENCE ON INTELLIGENT USER INTERFACES, 2007, : 3 - 3
  • [49] Text-image communication, image-text communication
    Münkner, J
    [J]. ZEITSCHRIFT FUR GERMANISTIK, 2004, 14 (02): : 454 - 455
  • [50] Image-text interaction graph neural network for image-text sentiment analysis
    Wenxiong Liao
    Bi Zeng
    Jianqi Liu
    Pengfei Wei
    Jiongkun Fang
    [J]. Applied Intelligence, 2022, 52 : 11184 - 11198