Multilevel Semantic Interaction Alignment for Video-Text Cross-Modal Retrieval

被引:1
|
作者
Chen, Lei [1 ]
Deng, Zhen [2 ]
Liu, Libo [1 ]
Yin, Shibai [2 ]
机构
[1] Ningxia Univ, Coll Informat Engn, Yinchuan 750021, Peoples R China
[2] Ningxia Univ, Coll Informat Engn, Yinchuan 611130, Peoples R China
基金
中国国家自然科学基金;
关键词
Weak semantic data; video-text retrieval; cross-modal retrieval; cross-alignment; attention mechanism;
D O I
10.1109/TCSVT.2024.3360530
中图分类号
TM [电工技术]; TN [电子技术、通信技术];
学科分类号
0808 ; 0809 ;
摘要
Video-text cross-modal retrieval (VTR) is more natural and challenging than image-text retrieval, which has attracted increasing interest from researchers in recent years. To align VTR more closely with real-world scenarios, i.e., weak semantic text description as a query, we propose a multilevel semantic interaction alignment (MSIA) model. We develop a two-stream network, which decomposes video and text alignment into multiple dimensions. Specifically, in the video stream, to better align heterogeneity data, redundant video information is suppressed via the designed frame adaptation attention mechanism, and richer semantic interaction is achieved through a text-guided attention mechanism. Then, for text alignment in the video local region, we design a distinctive anchor frame strategy and a word selection method. Finally, a cross-granularity alignment approach is designed to learn more and finer semantic features. With the above schema, the alignment between video and weak semantic text descriptions is reinforced, further alleviating the issues of difficult alignment caused by weak semantic text descriptions. The experimental results on VTR benchmark datasets show the competitive performance of our approach in comparison to that of state-of-the-art methods. The code is available at: https://github.com/jiaranjintianchism/MSIA.
引用
收藏
页码:6559 / 6575
页数:17
相关论文
共 50 条
  • [31] Dual Alignment Unsupervised Domain Adaptation for Video-Text Retrieval
    Hao, Xiaoshuai
    Zhang, Wanqian
    Wu, Dayan
    Zhu, Fei
    Li, Bo
    [J]. 2023 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2023, : 18962 - 18972
  • [32] Multilevel Deep Semantic Feature Asymmetric Network for Cross-Modal Hashing Retrieval
    Jiang, Xiaolong
    Fan, Jiabao
    Zhang, Jie
    Lin, Ziyong
    Li, Mingyong
    [J]. IEEE LATIN AMERICA TRANSACTIONS, 2024, 22 (08) : 621 - 631
  • [33] Unified Coarse-to-Fine Alignment for Video-Text Retrieval
    Wang, Ziyang
    Sung, Yi-Lin
    Cheng, Feng
    Bertasius, Gedas
    Bansal, Mohit
    [J]. 2023 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION, ICCV, 2023, : 2804 - 2815
  • [34] Dual Encoding Integrating Key Frame Extraction for Video-text Cross-modal Entity Resolution
    Zeng, Zhixian
    Cao, Jianjun
    Weng, Nianfeng
    Jiang, Guoquan
    Fan, Qiang
    [J]. Binggong Xuebao/Acta Armamentarii, 2022, 43 (05): : 1107 - 1116
  • [35] SEMANTIC-PRESERVING METRIC LEARNING FOR VIDEO-TEXT RETRIEVAL
    Choo, Sungkwon
    Ha, Seong Jong
    Lee, Joonsoo
    [J]. 2021 IEEE INTERNATIONAL CONFERENCE ON IMAGE PROCESSING (ICIP), 2021, : 2388 - 2392
  • [36] Learning Shared Semantic Space with Correlation Alignment for Cross-Modal Event Retrieval
    Yang, Zhenguo
    Lin, Zehang
    Kang, Peipei
    Lv, Jianming
    Li, Qing
    Liu, Wenyin
    [J]. ACM TRANSACTIONS ON MULTIMEDIA COMPUTING COMMUNICATIONS AND APPLICATIONS, 2020, 16 (01)
  • [37] Semantic enhancement and multi-level alignment network for cross-modal retrieval
    Chen, Jia
    Zhang, Hong
    [J]. MULTIMEDIA TOOLS AND APPLICATIONS, 2024,
  • [38] Cross-modal Semantic Enhanced Interaction for Image-Sentence Retrieval
    Ge, Xuri
    Chen, Fuhai
    Xu, Songpei
    Tao, Fuxiang
    Jose, Joemon M.
    [J]. 2023 IEEE/CVF WINTER CONFERENCE ON APPLICATIONS OF COMPUTER VISION (WACV), 2023, : 1022 - 1031
  • [39] SAM: cross-modal semantic alignments module for image-text retrieval
    Park, Pilseo
    Jang, Soojin
    Cho, Yunsung
    Kim, Youngbin
    [J]. MULTIMEDIA TOOLS AND APPLICATIONS, 2024, 83 (04) : 12363 - 12377
  • [40] Learning Hierarchical Semantic Correspondences for Cross-Modal Image-Text Retrieval
    Zeng, Sheng
    Liu, Changhong
    Zhou, Jun
    Chen, Yong
    Jiang, Aiwen
    Li, Hanxi
    [J]. PROCEEDINGS OF THE 2022 INTERNATIONAL CONFERENCE ON MULTIMEDIA RETRIEVAL, ICMR 2022, 2022, : 239 - 248