Based on Spatial and Temporal Implicit Semantic Relational Inference for Cross-Modal Retrieval

被引:0
|
作者
Jin M. [1 ]
Hu W. [1 ]
Zhu L. [2 ]
Wang X. [3 ]
Hong R. [1 ]
机构
[1] School of Computer and Information, Hefei University of Technology, Hefei
[2] School of Electronic and Information Engineering, Tongji University, Shanghai
[3] School of Data Science, University of Science and Technology of China, Hefei
关键词
Computational modeling; cross-modal retrieval; Data models; Feature extraction; semantic alignment; semantic mining; Semantics; Task analysis; temporal space inference; Training; Visualization;
D O I
10.1109/TCSVT.2024.3411298
中图分类号
学科分类号
摘要
To meet users’ demands for video retrieval, text-video cross-modal retrieval technology continues to evolve. Methods based on pre-trained models and transfer learning are widely employed in designing cross-modal retrieval models, significantly enhancing the accuracy of video retrieval. However, these methods exhibit shortcomings when it comes to studying the relationships between video frames, preventing the model from fully establishing the hidden semantic relationships within video features. To further deduce the implicit semantic relationships among video frames, we propose a cross-modal retrieval model based on graph convolutional networks (GCN) and visual semantic inference (GVSI). The GCN is utilized to establish relationships between video frame features, facilitating the mining of hidden semantic information across video frames. In order to use text semantic features to help the model to infer temporal and implicit semantic information between video frames, we introduce a semantic mining and temporal space (SM&TS) inference module. Additionally, we design semantic alignment modules (SA_M) to align explicit and implicit object features present in both video and text. Finally, we analyze and validate the effectiveness of the model using MSR-VTT, MSVD, and LSMDC datasets. IEEE
引用
收藏
页码:1 / 1
相关论文
共 50 条
  • [1] Cross-Modal Retrieval Based on Semantic Filtering and Adaptive Pooling
    Qiao, Nan
    Mao, Junyi
    Xie, Hao
    Wang, Zhiguo
    Yin, Guangqiang
    PROCEEDINGS OF THE 13TH INTERNATIONAL CONFERENCE ON COMPUTER ENGINEERING AND NETWORKS, VOL II, CENET 2023, 2024, 1126 : 296 - 310
  • [2] Deep Semantic Mapping for Cross-Modal Retrieval
    Wang, Cheng
    Yang, Haojin
    Meinel, Christoph
    2015 IEEE 27TH INTERNATIONAL CONFERENCE ON TOOLS WITH ARTIFICIAL INTELLIGENCE (ICTAI 2015), 2015, : 234 - 241
  • [3] Analyzing semantic correlation for cross-modal retrieval
    Liang Xie
    Peng Pan
    Yansheng Lu
    Multimedia Systems, 2015, 21 : 525 - 539
  • [4] Semantic consistency hashing for cross-modal retrieval
    Yao, Tao
    Kong, Xiangwei
    Fu, Haiyan
    Tian, Qi
    NEUROCOMPUTING, 2016, 193 : 250 - 259
  • [5] Analyzing semantic correlation for cross-modal retrieval
    Xie, Liang
    Pan, Peng
    Lu, Yansheng
    MULTIMEDIA SYSTEMS, 2015, 21 (06) : 525 - 539
  • [6] Multi-modal semantic autoencoder for cross-modal retrieval
    Wu, Yiling
    Wang, Shuhui
    Huang, Qingming
    NEUROCOMPUTING, 2019, 331 : 165 - 175
  • [7] Label-Based Deep Semantic Hashing for Cross-Modal Retrieval
    Weng, Weiwei
    Wu, Jiagao
    Yang, Lu
    Liu, Linfeng
    Hu, Bin
    NEURAL INFORMATION PROCESSING (ICONIP 2019), PT III, 2019, 11955 : 24 - 36
  • [8] CAESAR: concept augmentation based semantic representation for cross-modal retrieval
    Zhu, Lei
    Song, Jiayu
    Wei, Xiangxiang
    Yu, Hao
    Long, Jun
    MULTIMEDIA TOOLS AND APPLICATIONS, 2022, 81 (24) : 34213 - 34243
  • [9] CAESAR: concept augmentation based semantic representation for cross-modal retrieval
    Lei Zhu
    Jiayu Song
    Xiangxiang Wei
    Hao Yu
    Jun Long
    Multimedia Tools and Applications, 2022, 81 : 34213 - 34243
  • [10] Cross-Modal Video Moment Retrieval with Spatial and Language-Temporal Attention
    Jiang, Bin
    Huang, Xin
    Yang, Chao
    Yuan, Junsong
    ICMR'19: PROCEEDINGS OF THE 2019 ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA RETRIEVAL, 2019, : 217 - 225