Multilevel Semantic Interaction Alignment for Video-Text Cross-Modal Retrieval

被引:1
|
作者
Chen, Lei [1 ]
Deng, Zhen [2 ]
Liu, Libo [1 ]
Yin, Shibai [2 ]
机构
[1] Ningxia Univ, Coll Informat Engn, Yinchuan 750021, Peoples R China
[2] Ningxia Univ, Coll Informat Engn, Yinchuan 611130, Peoples R China
基金
中国国家自然科学基金;
关键词
Weak semantic data; video-text retrieval; cross-modal retrieval; cross-alignment; attention mechanism;
D O I
10.1109/TCSVT.2024.3360530
中图分类号
TM [电工技术]; TN [电子技术、通信技术];
学科分类号
0808 ; 0809 ;
摘要
Video-text cross-modal retrieval (VTR) is more natural and challenging than image-text retrieval, which has attracted increasing interest from researchers in recent years. To align VTR more closely with real-world scenarios, i.e., weak semantic text description as a query, we propose a multilevel semantic interaction alignment (MSIA) model. We develop a two-stream network, which decomposes video and text alignment into multiple dimensions. Specifically, in the video stream, to better align heterogeneity data, redundant video information is suppressed via the designed frame adaptation attention mechanism, and richer semantic interaction is achieved through a text-guided attention mechanism. Then, for text alignment in the video local region, we design a distinctive anchor frame strategy and a word selection method. Finally, a cross-granularity alignment approach is designed to learn more and finer semantic features. With the above schema, the alignment between video and weak semantic text descriptions is reinforced, further alleviating the issues of difficult alignment caused by weak semantic text descriptions. The experimental results on VTR benchmark datasets show the competitive performance of our approach in comparison to that of state-of-the-art methods. The code is available at: https://github.com/jiaranjintianchism/MSIA.
引用
收藏
页码:6559 / 6575
页数:17
相关论文
共 50 条
  • [1] Multi-Level Cross-Modal Semantic Alignment Network for Video-Text Retrieval
    Nian, Fudong
    Ding, Ling
    Hu, Yuxia
    Gu, Yanhong
    [J]. MATHEMATICS, 2022, 10 (18)
  • [2] Cross-Modal Video Retrieval Model Based on Video-Text Dual Alignment
    Che, Zhanbin
    Guo, Huaili
    [J]. INTERNATIONAL JOURNAL OF ADVANCED COMPUTER SCIENCE AND APPLICATIONS, 2024, 15 (02) : 303 - 311
  • [3] Survey on Video-Text Cross-Modal Retrieval
    Chen, Lei
    Xi, Yimeng
    Liu, Libo
    [J]. Computer Engineering and Applications, 2024, 60 (04) : 1 - 20
  • [4] Hierarchical Cross-Modal Graph Consistency Learning for Video-Text Retrieval
    Jin, Weike
    Zhao, Zhou
    Zhang, Pengcheng
    Zhu, Jieming
    He, Xiuqiang
    Zhuang, Yueting
    [J]. SIGIR '21 - PROCEEDINGS OF THE 44TH INTERNATIONAL ACM SIGIR CONFERENCE ON RESEARCH AND DEVELOPMENT IN INFORMATION RETRIEVAL, 2021, : 1114 - 1124
  • [5] CMMT: Cross-Modal Meta-Transformer for Video-Text Retrieval
    Gao, Yizhao
    Lu, Zhiwu
    [J]. PROCEEDINGS OF THE 2023 ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA RETRIEVAL, ICMR 2023, 2023, : 76 - 84
  • [6] Learning Joint Embedding with Multimodal Cues for Cross-Modal Video-Text Retrieval
    Mithun, Niluthpol Chowdhury
    Li, Juncheng
    Metze, Florian
    Roy-Chowdhury, Amit K.
    [J]. ICMR '18: PROCEEDINGS OF THE 2018 ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA RETRIEVAL, 2018, : 19 - 27
  • [7] Fine-Grained Cross-Modal Contrast Learning for Video-Text Retrieval
    Liu, Hui
    Lv, Gang
    Gu, Yanhong
    Nian, Fudong
    [J]. ADVANCED INTELLIGENT COMPUTING TECHNOLOGY AND APPLICATIONS, PT V, ICIC 2024, 2024, 14866 : 298 - 310
  • [8] Multi-Feature Graph Attention Network for Cross-Modal Video-Text Retrieval
    Hao, Xiaoshuai
    Zhou, Yucan
    Wu, Dayan
    Zhang, Wanqian
    Li, Bo
    Wang, Weiping
    [J]. PROCEEDINGS OF THE 2021 INTERNATIONAL CONFERENCE ON MULTIMEDIA RETRIEVAL (ICMR '21), 2021, : 135 - 143
  • [9] Video-Text as Game Players: Hierarchical Banzhaf Interaction for Cross-Modal Representation Learning
    Jin, Peng
    Huang, Jinfa
    Xiong, Pengfei
    Tian, Shangxuan
    Liu, Chang
    Ji, Xiangyang
    Yuan, Li
    Chen, Jie
    [J]. 2023 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION, CVPR, 2023, : 2472 - 2482
  • [10] CLIP4Hashing: Unsupervised Deep Hashing for Cross-Modal Video-Text Retrieval
    Zhuo, Yaoxin
    Li, Yikang
    Hsiao, Jenhao
    Ho, Chiuman
    Li, Baoxin
    [J]. PROCEEDINGS OF THE 2022 INTERNATIONAL CONFERENCE ON MULTIMEDIA RETRIEVAL, ICMR 2022, 2022, : 158 - 166