Multi-level Alignment Network for Domain Adaptive Cross-modal Retrieval

被引:15
|
作者
Dong, Jianfeng [1 ,4 ]
Long, Zhongzi [2 ]
Mao, Xiaofeng [3 ]
Lin, Changting [1 ,5 ]
He, Yuan [3 ]
Ji, Shouling [2 ,4 ]
机构
[1] Zhejiang Gongshang Univ, Hangzhou, Peoples R China
[2] Zhejiang Univ, Hangzhou, Peoples R China
[3] Alibaba Grp, Hangzhou, Peoples R China
[4] Alibaba Zhejiang Univ Joint Res Inst Frontier Tec, Hangzhou, Peoples R China
[5] Chinese Acad Sci, Inst Informat Engn, State Key Lab Informat Secur, Beijing, Peoples R China
关键词
Cross-modal retrieval; Domain adaptation; Cross-dataset training; Adversarial learning; IMAGE;
D O I
10.1016/j.neucom.2021.01.114
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Cross-modal retrieval is an important but challenging research task in the multimedia community. Most existing works of this task are supervised, which typically train models on a large number of aligned image-text/video-text pairs, making an assumption that training and testing data are drawn from the same distribution. If this assumption does not hold, traditional cross-modal retrieval methods may experience a performance drop at the evaluation. In this paper, we introduce a new task named as domain adaptive cross-modal retrieval, where training (source) data and testing (target) data are from different domains. The task is challenging, as there are not only the semantic gap and modality gap between visual and textual items, but also domain gap between source and target domains. Therefore, we propose a Multi-level Alignment Network (MAN) that has two mapping modules to project visual and textual modalities in a common space respectively, and three alignments are used to learn more discriminative features in the space. A semantic alignment is used to reduce the semantic gap, a cross-modality alignment and a cross-domain alignment are employed to alleviate the modality gap and domain gap. Extensive experiments in the context of domain-adaptive image-text retrieval and video-text retrieval demonstrate that our proposed model, MAN, consistently outperforms multiple baselines, showing a superior generalization ability for target data. Moreover, MAN establishes a new state-of-the-art for the large-scale text-to video retrieval on TRECVID 2017, 2018 Ad-hoc Video Search benchmark. (c) 2021 Elsevier B.V. All rights reserved.
引用
收藏
页码:207 / 219
页数:13
相关论文
共 50 条
  • [1] Semantic enhancement and multi-level alignment network for cross-modal retrieval
    Chen, Jia
    Zhang, Hong
    [J]. MULTIMEDIA TOOLS AND APPLICATIONS, 2024,
  • [2] Multi-Level Cross-Modal Semantic Alignment Network for Video-Text Retrieval
    Nian, Fudong
    Ding, Ling
    Hu, Yuxia
    Gu, Yanhong
    [J]. MATHEMATICS, 2022, 10 (18)
  • [3] Multi-Level Cross-Modal Alignment for Image Clustering
    Qiu, Liping
    Zhang, Qin
    Chen, Xiaojun
    Cai, Shaotian
    [J]. THIRTY-EIGHTH AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE, VOL 38 NO 13, 2024, : 14695 - 14703
  • [4] Deep Multi-Level Semantic Hashing for Cross-Modal Retrieval
    Ji, Zhenyan
    Yao, Weina
    Wei, Wei
    Song, Houbing
    Pi, Huaiyu
    [J]. IEEE ACCESS, 2019, 7 : 23667 - 23674
  • [5] Multi-Level Correlation Adversarial Hashing for Cross-Modal Retrieval
    Ma, Xinhong
    Zhang, Tianzhu
    Xu, Changsheng
    [J]. IEEE TRANSACTIONS ON MULTIMEDIA, 2020, 22 (12) : 3101 - 3114
  • [6] A Multi-Level Alignment and Cross-Modal Unified Semantic Graph Refinement Network for Conversational Emotion Recognition
    Zhang, Xiaoheng
    Cui, Weigang
    Hu, Bin
    Li, Yang
    [J]. IEEE TRANSACTIONS ON AFFECTIVE COMPUTING, 2024, 15 (03) : 1553 - 1566
  • [7] Adversarial Modality Alignment Network for Cross-Modal Molecule Retrieval
    Zhao, Wenyu
    Zhou, Dong
    Cao, Buqing
    Zhang, Kai
    Chen, Jinjun
    [J]. IEEE Transactions on Artificial Intelligence, 2024, 5 (01): : 278 - 289
  • [8] Adaptive multi-label structure preserving network for cross-modal retrieval
    Zhu, Jie
    Zhang, Hui
    Chen, Junfen
    Xie, Bojun
    Liu, Jianan
    Zhang, Junsan
    [J]. INFORMATION SCIENCES, 2024, 682
  • [9] Disambiguity and Alignment: An Effective Multi-Modal Alignment Method for Cross-Modal Recipe Retrieval
    Zou, Zhuoyang
    Zhu, Xinghui
    Zhu, Qinying
    Zhang, Hongyan
    Zhu, Lei
    [J]. FOODS, 2024, 13 (11)
  • [10] Multi-level adversarial attention cross-modal hashing
    Wang, Benhui
    Zhang, Huaxiang
    Zhu, Lei
    Nie, Liqiang
    Liu, Li
    [J]. SIGNAL PROCESSING-IMAGE COMMUNICATION, 2023, 117