Latent Space Semantic Supervision Based on Knowledge Distillation for Cross-Modal Retrieval

被引:5
|
作者
Zhang, Li [1 ]
Wu, Xiangqian [1 ]
机构
[1] Harbin Inst Technol, Fac Comp, Harbin, Peoples R China
关键词
Cross-modal retrieval; image-text matching; latent space supervision; knowledge distillation;
D O I
10.1109/TIP.2022.3220051
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
As an important field in information retrieval, fine-grained cross-modal retrieval has received great attentions from researchers. Existing fine-grained cross-modal retrieval methods made several improvements in capturing the fine-grained interplay between vision and language, failing to consider the fine-grained correspondences between the features in the image latent space and the text latent space respectively, which may lead to inaccurate inference of intra-modal relations or false alignment of cross-modal information. Considering that object detection can get the fine-grained correspondences of image region features and the corresponding semantic features, this paper proposed a novel latent space semantic supervision model based on knowledge distillation (L3S-KD), which trains classifiers supervised by the fine-grained correspondences obtained from an object detection model by using knowledge distillation for image latent space fine-grained alignment, and by the labels of objects and attributes for text latent space fine-grained alignment. Compared with existing fine-grained correspondence matching methods, L3S-KD can learn more accurate semantic similarities for local fragments in image-text pairs. Extensive experiments on MS-COCO and Flickr30K datasets demonstrate that the L3S-KD model consistently outperforms state-of-the-art methods for image-text matching.
引用
收藏
页码:7154 / 7164
页数:11
相关论文
共 50 条
  • [1] CKDH: CLIP-Based Knowledge Distillation Hashing for Cross-Modal Retrieval
    Li, Jiaxing
    Wong, Wai Keung
    Jiang, Lin
    Fang, Xiaozhao
    Xie, Shengli
    Xu, Yong
    [J]. IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, 2024, 34 (07) : 6530 - 6541
  • [2] Unsupervised Deep Cross-Modal Hashing by Knowledge Distillation for Large-scale Cross-modal Retrieval
    Li, Mingyong
    Wang, Hongya
    [J]. PROCEEDINGS OF THE 2021 INTERNATIONAL CONFERENCE ON MULTIMEDIA RETRIEVAL (ICMR '21), 2021, : 183 - 191
  • [3] Discriminative Latent Feature Space Learning for Cross-Modal Retrieval
    Tang, Xu
    Deng, Cheng
    Gao, Xinbo
    [J]. ICMR'15: PROCEEDINGS OF THE 2015 ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA RETRIEVAL, 2015, : 507 - 510
  • [4] Discriminative Latent Semantic Regression for Cross-Modal Hashing of Multimedia Retrieval
    Wan, Jianwu
    Wang, Yi
    [J]. 2018 IEEE FOURTH INTERNATIONAL CONFERENCE ON MULTIMEDIA BIG DATA (BIGMM), 2018,
  • [5] Latent semantic-enhanced discrete hashing for cross-modal retrieval
    Liu, Yun
    Ji, Shujuan
    Fu, Qiang
    Zhao, Jianli
    Zhao, Zhongying
    Gong, Maoguo
    [J]. APPLIED INTELLIGENCE, 2022, 52 (14) : 16004 - 16020
  • [6] Efficient discrete latent semantic hashing for scalable cross-modal retrieval
    Lu, Xu
    Zhu, Lei
    Cheng, Zhiyong
    Song, Xuemeng
    Zhang, Huaxiang
    [J]. SIGNAL PROCESSING, 2019, 154 : 217 - 231
  • [7] Latent semantic-enhanced discrete hashing for cross-modal retrieval
    Yun Liu
    Shujuan Ji
    Qiang Fu
    Jianli Zhao
    Zhongying Zhao
    Maoguo Gong
    [J]. Applied Intelligence, 2022, 52 : 16004 - 16020
  • [8] Cross-Modal Retrieval Based on Semantic Filtering and Adaptive Pooling
    Qiao, Nan
    Mao, Junyi
    Xie, Hao
    Wang, Zhiguo
    Yin, Guangqiang
    [J]. PROCEEDINGS OF THE 13TH INTERNATIONAL CONFERENCE ON COMPUTER ENGINEERING AND NETWORKS, VOL II, CENET 2023, 2024, 1126 : 296 - 310
  • [9] Deep Semantic Mapping for Cross-Modal Retrieval
    Wang, Cheng
    Yang, Haojin
    Meinel, Christoph
    [J]. 2015 IEEE 27TH INTERNATIONAL CONFERENCE ON TOOLS WITH ARTIFICIAL INTELLIGENCE (ICTAI 2015), 2015, : 234 - 241
  • [10] Analyzing semantic correlation for cross-modal retrieval
    Liang Xie
    Peng Pan
    Yansheng Lu
    [J]. Multimedia Systems, 2015, 21 : 525 - 539