Multi-Granularity Interactive Transformer Hashing for Cross-modal Retrieval

被引:12
|
作者
Liu, Yishu [1 ]
Wu, Qingpeng [1 ]
Zhang, Zheng [1 ,2 ]
Zhang, Jingyi [3 ]
Lu, Guangming [1 ]
机构
[1] Harbin Inst Technol, Shenzhen, Peoples R China
[2] Peng Cheng Lab, Shenzhen, Peoples R China
[3] Huawei Technol Co Ltd, Hangzhou, Peoples R China
来源
PROCEEDINGS OF THE 31ST ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA, MM 2023 | 2023年
基金
中国国家自然科学基金;
关键词
Cross-modal hashing; cross-modal retrieval; multi-granularity; transformer; knowledge distillation; contrastive learning;
D O I
10.1145/3581783.3612411
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
With the powerful representation ability and privileged efficiency, deep cross-modal hashing (DCMH) has become an emerging fast similarity search technique. Prior studies primarily focus on exploring pairwise similarities across modalities, but fail to comprehensively capture the multi-grained semantic correlations during intra- and inter-modal negotiation. To tackle this issue, this paper proposes a novel Multi-granularity Interactive Transformer Hashing (MITH) network, which hierarchically considers both coarse- and fine-grained similarity measurements across different modalities in one unified transformer-based framework. To the best of our knowledge, this is the first attempt for multi-granularity transformer-based cross-modal hashing. Specifically, a well-designed distilled intra-modal interaction module is deployed to excavate modality-specific concept knowledge with global-local knowledge distillation under the guidance of implicit conceptual category-level representations. Moreover, we construct a contrastive inter-modal alignment module to mine modality-independent semantic concept correspondences with instance- and token-wise contrastive learning, respectively. Such a collaborative learning paradigm can jointly alleviate the heterogeneity and semantic gaps among different modalities from a multi-granularity perspective, yielding discriminative modality-invariant hash codes. Extensive experiments on multiple representative cross-modal datasets demonstrate the consistent superiority of MITH over the existing state-of-the-art baselines. The codes are available at https://github.com/DarrenZZhang/MITH.
引用
收藏
页码:893 / 902
页数:10
相关论文
共 50 条
  • [1] Multi-Granularity Semantic Information Integration Graph for Cross-Modal Hash Retrieval
    Han, Zhichao
    Bin Azman, Azreen
    Khalid, Fatimah Binti
    Mustaffa, Mas Rina Binti
    IEEE ACCESS, 2024, 12 : 44682 - 44694
  • [2] Hugs Bring Double Benefits: Unsupervised Cross-Modal Hashing with Multi-granularity Aligned Transformers
    Wang, Jinpeng
    Zeng, Ziyun
    Chen, Bin
    Wang, Yuting
    Liao, Dongliang
    Li, Gongfu
    Wang, Yiru
    Xia, Shu-Tao
    INTERNATIONAL JOURNAL OF COMPUTER VISION, 2024, 132 (08) : 2765 - 2797
  • [3] Unsupervised Multi-modal Hashing for Cross-Modal Retrieval
    Yu, Jun
    Wu, Xiao-Jun
    Zhang, Donglin
    COGNITIVE COMPUTATION, 2022, 14 (03) : 1159 - 1171
  • [4] Unsupervised Multi-modal Hashing for Cross-Modal Retrieval
    Jun Yu
    Xiao-Jun Wu
    Donglin Zhang
    Cognitive Computation, 2022, 14 : 1159 - 1171
  • [5] Semantic-alignment transformer and adversary hashing for cross-modal retrieval
    Sun, Yajun
    Wang, Meng
    Ma, Ying
    APPLIED INTELLIGENCE, 2024, 54 (17-18) : 7581 - 7602
  • [6] Discriminant Adversarial Hashing Transformer for Cross-modal Vessel Image Retrieval
    Guan X.
    Guo J.
    Lu Y.
    Dianzi Yu Xinxi Xuebao/Journal of Electronics and Information Technology, 2023, 45 (12): : 4411 - 4420
  • [7] Contrastive Transformer Cross-Modal Hashing for Video-Text Retrieval
    Shen, Xiaobo
    Huang, Qianxin
    Lan, Long
    Zheng, Yuhui
    PROCEEDINGS OF THE THIRTY-THIRD INTERNATIONAL JOINT CONFERENCE ON ARTIFICIAL INTELLIGENCE, IJCAI 2024, 2024, : 1227 - 1235
  • [8] Hashing for Cross-Modal Similarity Retrieval
    Liu, Yao
    Yuan, Yanhong
    Huang, Qiaoli
    Huang, Zhixing
    2015 11TH INTERNATIONAL CONFERENCE ON SEMANTICS, KNOWLEDGE AND GRIDS (SKG), 2015, : 1 - 8
  • [9] Visual Question Generation Under Multi-granularity Cross-Modal Interaction
    Chai, Zi
    Wan, Xiaojun
    Han, Soyeon Caren
    Poon, Josiah
    MULTIMEDIA MODELING, MMM 2023, PT I, 2023, 13833 : 255 - 266
  • [10] ROBUST MULTI-VIEW HASHING FOR CROSS-MODAL RETRIEVAL
    Wang, Haitao
    Chen, Hui
    Meng, Min
    Wu, JiGang
    2019 IEEE INTERNATIONAL CONFERENCE ON MULTIMEDIA AND EXPO (ICME), 2019, : 1012 - 1017