Multi-Level Cross-Modal Semantic Alignment Network for Video-Text Retrieval

被引:2
|
作者
Nian, Fudong [1 ,2 ]
Ding, Ling [1 ]
Hu, Yuxia [2 ]
Gu, Yanhong [1 ]
机构
[1] Hefei Univ, Sch Adv Mfg Engn, Hefei 230601, Peoples R China
[2] Anhui Jianzhu Univ, Anhui Int Joint Res Ctr Ancient Architecture Inte, Hefei 230601, Peoples R China
关键词
video-text retrieval; multi-level space learning; cross-modal similarity calculation; IMAGE;
D O I
10.3390/math10183346
中图分类号
O1 [数学];
学科分类号
0701 ; 070101 ;
摘要
This paper strives to improve the performance of video-text retrieval. To date, many algorithms have been proposed to facilitate the similarity measure of video-text retrieval from the single global semantic to multi-level semantics. However, these methods may suffer from the following limitations: (1) largely ignore the relationship semantic which results in semantic levels are insufficient; (2) it is incomplete to constrain the real-valued features of different modalities to be in the same space only through the feature distance measurement; (3) fail to handle the problem that the distributions of attribute labels in different semantic levels are heavily imbalanced. To overcome the above limitations, this paper proposes a novel multi-level cross-modal semantic alignment network (MCSAN) for video-text retrieval by jointly modeling video-text similarity on global, entity, action and relationship semantic levels in a unified deep model. Specifically, both video and text are first decomposed into global, entity, action and relationship semantic levels by carefully designing spatial-temporal semantic learning structures. Then, we utilize KLDivLoss and a cross-modal parameter-share attribute projection layer as statistical constraints to ensure that representations from different modalities in different semantic levels are projected into a common semantic space. In addition, a novel focal binary cross-entropy (FBCE) loss function is presented, which is the first effort to model the unbalanced attribute distribution problem for video-text retrieval. MCSAN is practically effective to take the advantage of the complementary information among four semantic levels. Extensive experiments on two challenging video-text retrieval datasets, namely, MSR-VTT and VATEX, show the viability of our method.
引用
收藏
页数:19
相关论文
共 50 条
  • [1] Multilevel Semantic Interaction Alignment for Video-Text Cross-Modal Retrieval
    Chen, Lei
    Deng, Zhen
    Liu, Libo
    Yin, Shibai
    [J]. IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, 2024, 34 (07) : 6559 - 6575
  • [2] Semantic enhancement and multi-level alignment network for cross-modal retrieval
    Chen, Jia
    Zhang, Hong
    [J]. MULTIMEDIA TOOLS AND APPLICATIONS, 2024,
  • [3] Cross-Modal Video Retrieval Model Based on Video-Text Dual Alignment
    Che, Zhanbin
    Guo, Huaili
    [J]. INTERNATIONAL JOURNAL OF ADVANCED COMPUTER SCIENCE AND APPLICATIONS, 2024, 15 (02) : 303 - 311
  • [4] Survey on Video-Text Cross-Modal Retrieval
    Chen, Lei
    Xi, Yimeng
    Liu, Libo
    [J]. Computer Engineering and Applications, 2024, 60 (04) : 1 - 20
  • [5] Multi-Feature Graph Attention Network for Cross-Modal Video-Text Retrieval
    Hao, Xiaoshuai
    Zhou, Yucan
    Wu, Dayan
    Zhang, Wanqian
    Li, Bo
    Wang, Weiping
    [J]. PROCEEDINGS OF THE 2021 INTERNATIONAL CONFERENCE ON MULTIMEDIA RETRIEVAL (ICMR '21), 2021, : 135 - 143
  • [6] Multi-level Alignment Network for Domain Adaptive Cross-modal Retrieval
    Dong, Jianfeng
    Long, Zhongzi
    Mao, Xiaofeng
    Lin, Changting
    He, Yuan
    Ji, Shouling
    [J]. NEUROCOMPUTING, 2021, 440 : 207 - 219
  • [7] Deep Multi-Level Semantic Hashing for Cross-Modal Retrieval
    Ji, Zhenyan
    Yao, Weina
    Wei, Wei
    Song, Houbing
    Pi, Huaiyu
    [J]. IEEE ACCESS, 2019, 7 : 23667 - 23674
  • [8] CMMT: Cross-Modal Meta-Transformer for Video-Text Retrieval
    Gao, Yizhao
    Lu, Zhiwu
    [J]. PROCEEDINGS OF THE 2023 ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA RETRIEVAL, ICMR 2023, 2023, : 76 - 84
  • [9] Hierarchical Cross-Modal Graph Consistency Learning for Video-Text Retrieval
    Jin, Weike
    Zhao, Zhou
    Zhang, Pengcheng
    Zhu, Jieming
    He, Xiuqiang
    Zhuang, Yueting
    [J]. SIGIR '21 - PROCEEDINGS OF THE 44TH INTERNATIONAL ACM SIGIR CONFERENCE ON RESEARCH AND DEVELOPMENT IN INFORMATION RETRIEVAL, 2021, : 1114 - 1124
  • [10] Learning Joint Embedding with Multimodal Cues for Cross-Modal Video-Text Retrieval
    Mithun, Niluthpol Chowdhury
    Li, Juncheng
    Metze, Florian
    Roy-Chowdhury, Amit K.
    [J]. ICMR '18: PROCEEDINGS OF THE 2018 ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA RETRIEVAL, 2018, : 19 - 27