Cross-modal video retrieval algorithm based on multi-semantic clues

被引:0
|
作者
Ding L. [1 ]
Li Y. [1 ]
Yu C. [2 ]
Liu Y. [1 ]
Wang X. [1 ,3 ]
Qi S. [1 ,3 ]
机构
[1] School of Computer Science and Technology, Harbin Institute of Technology(Shenzhen), Shenzhen
[2] School of Digital Media, Shenzhen Institute of Information Technology, Shenzhen
[3] Peng Cheng Laboratory, Shenzhen
基金
中国国家自然科学基金;
关键词
Cross-modal video retrieval; Distance measurement loss function; Multi-leader attention mechanism; Multi-modal; Multi-semantic clues;
D O I
10.13700/j.bh.1001-5965.2020.0470
中图分类号
学科分类号
摘要
Most of the existing cross-modal video retrieval algorithms map heterogeneous data to a space, so that semantically similar data are close to each other and semantically dissimilar data are far from each other, that is, the global similarity relationship of different modal data is established. However, these methods ignore the rich semantic clues in the data, which makes the performance of feature generation poor. To solve this problem, we propose a cross-modal retrieval model based on multi-semantic clues. This model captures the data frames that play an important role in semantics within video model through multi-head self-attention mechanism, and pays attention to the important information of video data to obtain the global characteristics of the data. Bidirectional Gate Recurrent Unit (GRU) is used to capture the interaction characteristics between contexts within multimodal data. Our method can also mine the local information in video and text data through the joint coding of the slight differences between the local data. Through the global features, context interaction features and local features of the data, the multi-semantic clues of the multi-modal data are formed to better mine the semantic information in the data and improve the retrieval effect. Besides this, an improved triplet distance measurement loss function is proposed, which adopts the difficult negative sample mining method based on similarity sorting and improves the learning effect of cross-modal characteristics. Experiments on MSR-VTT dataset show that the proposed method improves the text retrieval video task by 11.1% compared with the state-of-the-art methods. Experiments on MSVD dataset show that the proposed method improves the text retrieval video task by 5.0% compared with the state-of-the-art methods. © 2021, Editorial Board of JBUAA. All right reserved.
引用
收藏
页码:596 / 604
页数:8
相关论文
共 22 条
  • [1] ZHANG H, WU F, ZHUANG Y T., Research on cross-media correlation inference and retrieval, Computer Research and Development, 45, 5, (2008)
  • [2] DONG J, LI X, SNOEK C G M., Predicting visual features from text for image and video caption retrieval, IEEE Transactions on Multimedia, 20, 12, pp. 3377-3388, (2018)
  • [3] MITHUN N C, LI J, METZE F, Et al., Learning joint embedding with multimodal cues for cross-modal video-text retrieval, Proceedings of the 2018 ACM on International Conference on Multimedia Retrieval, pp. 19-27, (2018)
  • [4] DONG J, LI X, SNOEK C G M., Word2VisualVec: Image and video to sentence matching by visual feature prediction
  • [5] TORABI A, TANDON N, SIGAL L., Learning language-visual embedding for movie understanding with natural-language
  • [6] RASIWASIA N, COSTA P J, COVIELLO E, Et al., A new approach to cross-modal multimedia retrieval, Proceedings of the 18th ACM International Conference on Multimedia, pp. 251-260, (2010)
  • [7] FAGHRI F, FLEET D J, KIROS J R, Et al., VSE++: Improving visual-semantic embeddings with hard negatives
  • [8] GONG Y, KE Q, ISARD M, Et al., A multi-view embedding space for modeling internet images, tags, and their semantics, International Journal of Computer Vision, 106, 2, pp. 210-233, (2014)
  • [9] HODOSH M, YOUNG P, HOCKENMAIER J., Framing image description as a ranking task: Data, models and evaluation metrics, Journal of Artificial Intelligence Research, 47, 24, pp. 853-899, (2013)
  • [10] LI Z X, SHI Z P, CHEN H C, Et al., Multi-modal image retrieval based on semantic learning, Computer Engineering, 39, 3, pp. 258-263, (2013)