Cross-modal image retrieval with deep mutual information maximization

被引:8
|
作者
Gu, Chunbin [1 ,3 ]
Bu, Jiajun [1 ,3 ,4 ]
Zhou, Xixi [1 ,3 ]
Yao, Chengwei [1 ,3 ]
Ma, Dongfang [2 ,5 ]
Yu, Zhi [1 ,3 ]
Yan, Xifeng [6 ]
机构
[1] Zhejiang Univ, Coll Comp Sci, Zhejiang Prov Key Lab Serv Robot, Hangzhou 310007, Peoples R China
[2] Zhejiang Univ, Inst Marine Sensing & Networking, Hangzhou 310058, Peoples R China
[3] Alibaba Zhejiang Univ, Joint Inst Frontier Technol, Hangzhou 310007, Peoples R China
[4] MOE Key Lab Machine Percept, Beijing 100871, Peoples R China
[5] Zhejiang Univ, Key Lab Ocean Observat Imaging Testbed Zhejiang Pr, Zhoushan 316021, Peoples R China
[6] Univ Calif Santa Barbara, Dept Comp Sci, Santa Barbara, CA 93106 USA
基金
国家重点研发计划; 中国国家自然科学基金;
关键词
Cross -modal Image Retrieval; Mutual Information; Deep Metric Learning; Self-supervised Learning; MODELS;
D O I
10.1016/j.neucom.2022.01.078
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
In this paper, we study the cross-modal image retrieval, where the inputs contain a source image plus some text that describes certain modifications to this image and the desired image. Prior work usually uses a three-stage strategy to tackle this task: 1) extracting the features of the inputs; 2) fusing the features of the source image and its modified text to obtain the fusion feature; 3) learning a similarity metric between the desired image and the source image plus modified text via deep metric learning. Since classical image/text encoders can learn useful representations and common pair-based loss functions of distance metric learning are enough for cross-modal retrieval, people usually improve retrieval accuracy by designing new fusion networks. However, these methods do not successfully handle the modality gap caused by the inconsistent feature distributions of different modalities, which greatly influences the feature fusion and the similarity learning. To alleviate this problem, we apply the contrastive self-supervised learning method Deep InfoMax (DIM) [1] to our approach to bridge this gap by enhancing the dependence between the text, the image, and their fusion. Specifically, our method narrows the modality gap between the text modality and the image modality by maximizing mutual information between their semantically inconsistent representations. Moreover, we seek an effective common subspace for the semantically consistent features of the fusion and the desired images by utilizing Deep InfoMax between the low-level layer of the image encoder and the high-level layer of the fusion network. Extensive experiments on three large-scale benchmarks show that we have bridged the modality gap between different modalities and achieve the state-of-the-art retrieval performance. (c) 2022 Elsevier B.V. All rights reserved.
引用
收藏
页码:166 / 177
页数:12
相关论文
共 50 条
  • [1] Learning Disentangled Representation for Cross-Modal Retrieval with Deep Mutual Information Estimation
    Guo, Weikuo
    Huang, Huaibo
    Kong, Xiangwei
    He, Ran
    [J]. PROCEEDINGS OF THE 27TH ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA (MM'19), 2019, : 1712 - 1720
  • [2] Multimodal Mutual Information Maximization: A Novel Approach for Unsupervised Deep Cross-Modal Hashing
    Hoang, Tuan
    Do, Thanh-Toan
    Nguyen, Tam V.
    Cheung, Ngai-Man
    [J]. IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, 2023, 34 (09) : 6289 - 6302
  • [3] Deep Mutual Information Maximin for Cross-Modal Clustering
    Mao, Yiqiao
    Yan, Xiaoqiang
    Guo, Qiang
    Ye, Yangdong
    [J]. THIRTY-FIFTH AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE, THIRTY-THIRD CONFERENCE ON INNOVATIVE APPLICATIONS OF ARTIFICIAL INTELLIGENCE AND THE ELEVENTH SYMPOSIUM ON EDUCATIONAL ADVANCES IN ARTIFICIAL INTELLIGENCE, 2021, 35 : 8893 - 8901
  • [4] Deep Neighborhood-Preserving Hashing With Quadratic Spherical Mutual Information for Cross-Modal Retrieval
    Qin, Qibing
    Huo, Yadong
    Huang, Lei
    Dai, Jiangyan
    Zhang, Huihui
    Zhang, Wenfeng
    [J]. IEEE TRANSACTIONS ON MULTIMEDIA, 2024, 26 : 6361 - 6374
  • [5] Deep Normalization Cross-Modal Retrieval for Trajectory and Image Matching
    Zhang, Xudong
    Zhao, Wenfeng
    [J]. DATABASE SYSTEMS FOR ADVANCED APPLICATIONS. DASFAA 2023 INTERNATIONAL WORKSHOPS, BDMS 2023, BDQM 2023, GDMA 2023, BUNDLERS 2023, 2023, 13922 : 181 - 193
  • [6] Deep Cross-Modal Retrieval for Remote Sensing Image and Audio
    Guo Mao
    Yuan Yuan
    Lu Xiaoqiang
    [J]. 2018 10TH IAPR WORKSHOP ON PATTERN RECOGNITION IN REMOTE SENSING (PRRS), 2018,
  • [7] Deep Supervised Cross-modal Retrieval
    Zhen, Liangli
    Hu, Peng
    Wang, Xu
    Peng, Dezhong
    [J]. 2019 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2019), 2019, : 10386 - 10395
  • [8] Deep Cross-Modal Retrieval Between Spatial Image and Acoustic Speech
    Qian, Xinyuan
    Xue, Wei
    Zhang, Qiquan
    Tao, Ruijie
    Li, Haizhou
    [J]. IEEE TRANSACTIONS ON MULTIMEDIA, 2024, 26 : 4480 - 4489
  • [9] Deep Adversarial Cascaded Hashing for Cross-Modal Vessel Image Retrieval
    Guo, Jiaen
    Guan, Xin
    [J]. IEEE JOURNAL OF SELECTED TOPICS IN APPLIED EARTH OBSERVATIONS AND REMOTE SENSING, 2023, 16 : 2205 - 2220
  • [10] Cross-Modal Information Interaction Reasoning Network for Image and Text Retrieval
    Wei, Yuqi
    Li, Ning
    [J]. Computer Engineering and Applications, 2023, 59 (16) : 115 - 124