Cross-Modal Learning Based on Semantic Correlation and Multi-Task Learning for Text-Video Retrieval

被引:5
|
作者
Wu, Xiaoyu [1 ]
Wang, Tiantian [1 ]
Wang, Shengjin [2 ]
机构
[1] Commun Univ China, Sch Informat & Commun Engn, Beijing 100024, Peoples R China
[2] Tsinghua Univ, Dept Elect Engn, Beijing 100084, Peoples R China
基金
中国国家自然科学基金;
关键词
cross-model learning; text-video retrieval; semantic correlation; multi-task learning;
D O I
10.3390/electronics9122125
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
Text-video retrieval tasks face a great challenge in the semantic gap between cross modal information. Some existing methods transform the text or video into the same subspace to measure their similarity. However, this kind of method does not consider adding a semantic consistency constraint when associating the two modalities of semantic encoding, and the associated result is poor. In this paper, we propose a multi-modal retrieval algorithm based on semantic association and multi-task learning. Firstly, the multi-level features of video or text are extracted based on multiple deep learning networks, so that the information of the two modalities can be fully encoded. Then, in the public feature space where the two modalities information are mapped together, we propose a semantic similarity measurement and semantic consistency classification based on text-video features for a multi-task learning framework. With the semantic consistency classification task, the learning of semantic association task is restrained. So multi-task learning guides the better feature mapping of two modalities and optimizes the construction of unified feature subspace. Finally, the experimental results of our proposed algorithm on the Microsoft Video Description dataset (MSVD) and MSR-Video to Text (MSR-VTT) are better than the existing research, which prove that our algorithm can improve the performance of cross-modal retrieval.
引用
收藏
页码:1 / 17
页数:16
相关论文
共 50 条
  • [1] Text-video retrieval method based on enhanced self-attention and multi-task learning
    Xiaoyu Wu
    Jiayao Qian
    Tiantian Wang
    Multimedia Tools and Applications, 2023, 82 : 24387 - 24406
  • [2] Text-video retrieval method based on enhanced self-attention and multi-task learning
    Wu, Xiaoyu
    Qian, Jiayao
    Wang, Tiantian
    MULTIMEDIA TOOLS AND APPLICATIONS, 2023, 82 (16) : 24387 - 24406
  • [3] Multi-Modal Medical Image Matching Based on Multi-Task Learning and Semantic-Enhanced Cross-Modal Retrieval
    Zhang, Yilin
    TRAITEMENT DU SIGNAL, 2023, 40 (05) : 2041 - 2049
  • [4] Cross-Modal Video Emotion Analysis Method Based on Multi-Task Learning
    Miao, Yuqing
    Dong, Han
    Zhang, Wanzhen
    Zhou, Ming
    Cai, Guoyong
    Du, Huawei
    Computer Engineering and Applications, 2023, 59 (12) : 141 - 147
  • [5] A cross-modal conditional mechanism based on attention for text-video retrieval
    Du, Wanru
    Jing, Xiaochuan
    Zhu, Quan
    Wang, Xiaoyin
    Liu, Xuan
    MATHEMATICAL BIOSCIENCES AND ENGINEERING, 2023, 20 (11) : 20073 - 20092
  • [6] CRET: Cross-Modal Retrieval Transformer for Efficient Text-Video Retrieval
    Ji, Kaixiang
    Liu, Jiajia
    Hong, Weixiang
    Zhong, Liheng
    Wang, Jian
    Chen, Jingdong
    Chu, Wei
    PROCEEDINGS OF THE 45TH INTERNATIONAL ACM SIGIR CONFERENCE ON RESEARCH AND DEVELOPMENT IN INFORMATION RETRIEVAL (SIGIR '22), 2022, : 949 - 959
  • [7] Adversarial Multi-Grained Embedding Network for Cross-Modal Text-Video Retrieval
    Han, Ning
    Chen, Jingjing
    Zhang, Hao
    Wang, Huanwen
    Chen, Hao
    ACM TRANSACTIONS ON MULTIMEDIA COMPUTING COMMUNICATIONS AND APPLICATIONS, 2022, 18 (02)
  • [8] Deep Semantic Correlation with Adversarial Learning for Cross-Modal Retrieval
    Hua, Yan
    Du, Jianhe
    PROCEEDINGS OF 2019 IEEE 9TH INTERNATIONAL CONFERENCE ON ELECTRONICS INFORMATION AND EMERGENCY COMMUNICATION (ICEIEC 2019), 2019, : 252 - 255
  • [9] Semantic supervised learning based Cross-Modal Retrieval
    Li, Zhuoyi
    Fu, Hao
    Gu, Guanghua
    PROCEEDINGS OF THE ACM TURING AWARD CELEBRATION CONFERENCE-CHINA 2024, ACM-TURC 2024, 2024, : 207 - 209
  • [10] Multilingual Text-Video Cross-Modal Retrieval Model via Multilingual-Visual Common Space Learning
    Lin, Jun-An
    Bao, Cui-Zhu
    Dong, Jian-Feng
    Yang, Xun
    Wang, Xun
    Jisuanji Xuebao/Chinese Journal of Computers, 2024, 47 (09): : 2195 - 2210