Dual-Modal Attention-Enhanced Text-Video Retrieval with Triplet Partial Margin Contrastive Learning

被引:4
|
作者
Jiang, Chen [1 ,2 ]
Liu, Hong [2 ]
Yu, Xuzheng [2 ]
Wang, Qing [2 ]
Cheng, Yuan [1 ]
Xu, Jia [2 ]
Liu, Zhongyi [2 ]
Guo, Qingpei [2 ]
Chu, Wei [2 ]
Yang, Ming [2 ]
Qi, Yuan [1 ]
机构
[1] Fudan Univ, Artificial Intelligence Innovat & Incubat Inst, Shanghai, Peoples R China
[2] Ant Grp, Hangzhou, Peoples R China
关键词
Text-Video Retrieval; Dual-Modal Attention-Enhanced; Negative-aware InfoNCE; Triplet Partial Margin Contrastive Learning;
D O I
10.1145/3581783.3612006
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
In recent years, the explosion of web videos makes text-video retrieval increasingly essential and popular for video filtering, recommendation, and search. Text-video retrieval aims to rank relevant text/video higher than irrelevant ones. The core of this task is to precisely measure the cross-modal similarity between texts and videos. Recently, contrastive learning methods have shown promising results for text-video retrieval, most of which focus on the construction of positive and negative pairs to learn text and video representations. Nevertheless, they do not pay enough attention to hard negative pairs and lack the ability to model different levels of semantic similarity. To address these two issues, this paper improves contrastive learning using two novel techniques. First, to exploit hard examples for robust discriminative power, we propose a novel Dual-Modal Attention-Enhanced Module (DMAE) to mine hard negative pairs from textual and visual clues. By further introducing a Negative-aware InfoNCE (NegNCE) loss, we are able to adaptively identify all these hard negatives and explicitly highlight their impacts in the training loss. Second, our work argues that triplet samples can better model fine-grained semantic similarity compared to pairwise samples. We thereby present a new Triplet Partial Margin Contrastive Learning (TPM-CL) module to construct partial order triplet samples by automatically generating fine-grained hard negatives for matched text-video pairs. The proposed TPM-CL designs an adaptive token masking strategy with cross-modal interaction to model subtle semantic differences. Extensive experiments demonstrate that the proposed approach outperforms existing methods on four widely-used text-video retrieval datasets, including MSR-VTT, MSVD, DiDeMo and ActivityNet.
引用
收藏
页码:4626 / 4636
页数:11
相关论文
共 12 条
  • [1] A cross-modal conditional mechanism based on attention for text-video retrieval
    Du, Wanru
    Jing, Xiaochuan
    Zhu, Quan
    Wang, Xiaoyin
    Liu, Xuan
    [J]. MATHEMATICAL BIOSCIENCES AND ENGINEERING, 2023, 20 (11) : 20073 - 20092
  • [2] A Dual-Modal Attention-Enhanced Deep Learning Network for Quantification of Parkinson's Disease Characteristics
    Xia, Yi
    Yao, ZhiMing
    Ye, Qiang
    Cheng, Nan
    [J]. IEEE TRANSACTIONS ON NEURAL SYSTEMS AND REHABILITATION ENGINEERING, 2020, 28 (01) : 42 - 51
  • [3] Text-video retrieval method based on enhanced self-attention and multi-task learning
    Wu, Xiaoyu
    Qian, Jiayao
    Wang, Tiantian
    [J]. MULTIMEDIA TOOLS AND APPLICATIONS, 2023, 82 (16) : 24387 - 24406
  • [4] Text-video retrieval method based on enhanced self-attention and multi-task learning
    Xiaoyu Wu
    Jiayao Qian
    Tiantian Wang
    [J]. Multimedia Tools and Applications, 2023, 82 : 24387 - 24406
  • [5] X-Pool: Cross-Modal Language-Video Attention for Text-Video Retrieval
    Gorti, Satya Krishna
    Vouitsis, Noel
    Ma, Junwei
    Golestan, Keyvan
    Volkovs, Maksims
    Garg, Animesh
    Yu, Guangwei
    [J]. 2022 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2022), 2022, : 4996 - 5005
  • [6] Cross-Modal Learning Based on Semantic Correlation and Multi-Task Learning for Text-Video Retrieval
    Wu, Xiaoyu
    Wang, Tiantian
    Wang, Shengjin
    [J]. ELECTRONICS, 2020, 9 (12) : 1 - 17
  • [7] Multilingual Text-Video Cross-Modal Retrieval Model via Multilingual-Visual Common Space Learning
    Lin, Jun-An
    Bao, Cui-Zhu
    Dong, Jian-Feng
    Yang, Xun
    Wang, Xun
    [J]. Jisuanji Xuebao/Chinese Journal of Computers, 2024, 47 (09): : 2195 - 2210
  • [8] Cross-modal Contrastive Learning with Asymmetric Co-attention Network for Video Moment Retrieval
    Panta, Love
    Shrestha, Prashant
    Sapkota, Brabeem
    Bhattarai, Amrita
    Manandhar, Suresh
    Sah, Anand Kumar
    [J]. 2024 IEEE WINTER CONFERENCE ON APPLICATIONS OF COMPUTER VISION WORKSHOPS, WACVW 2024, 2024, : 617 - 624
  • [9] Coarse-to-fine dual-level attention for video-text cross modal retrieval
    Jin, Ming
    Zhang, Huaxiang
    Zhu, Lei
    Sun, Jiande
    Liu, Li
    [J]. KNOWLEDGE-BASED SYSTEMS, 2022, 242
  • [10] Dual-enhanced generative model with graph attention network and contrastive learning for aspect sentiment triplet extraction
    Xu, Haowen
    Tang, Mingwei
    Cai, Tao
    Hu, Jie
    Zhao, Mingfeng
    [J]. KNOWLEDGE-BASED SYSTEMS, 2024, 301