Utilizing Text-Video Relationships: A Text-Driven Multi-modal Fusion Framework for Moment Retrieval and Highlight Detection

被引:0
|
作者
Zhou, Siyu [1 ]
Zhang, Fjwei [2 ]
Wang, Ruomei [3 ]
Su, Zhuo [1 ]
机构
[1] Sun Yat Sen Univ, Natl Engn Res Ctr Digital Life, Sch Comp Sci & Engn, Guangzhou, Peoples R China
[2] North Univ China, Sch Comp Sci & Technol, Taiyuan, Peoples R China
[3] Sun Yat Sen Univ, Sch Software Engn, Guangzhou, Peoples R China
关键词
Multimodality; Moment retrieval; Highlight detection;
D O I
10.1007/978-981-97-8792-0_18
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Video moment retrieval and highlight detection are both text-related tasks in video understanding. Recent works primarily focus on enhancing the interaction between overall video features and query text. However, they overlook the relationships between distinct video modalities and the query text and fuse multi-modal video features in a query-agnostic manner. The overall video features obtained through this fusion method might lose information relevant to the query text, making it difficult to predict results accurately in subsequent reasoning. To address the issue, we introduce a Text-driven Integration Framework (TdIF) to fully leverage the relationships between video modalities and the query text for obtaining the enriched video representation. It fuses multi-modal video features under the guidance of the query text, effectively emphasizing query-related video information. In TdIF, we also design a query-adaptive token to enhance the interaction between the video and the query text. Furthermore, to enrich the semantic information of video representation, we introduce and leverage descriptive text of the video in a simple and efficient manner. Extensive experiments on QVHighlights, Charades-STA, TACoS and TVSum datasets validate the superiority of TdIF.
引用
收藏
页码:254 / 268
页数:15
相关论文
共 50 条
  • [21] Cross-Modal Dynamic Networks for Video Moment Retrieval With Text Query
    Wang, Gongmian
    Xu, Xing
    Shen, Fumin
    Lu, Huimin
    Ji, Yanli
    Shen, Heng Tao
    IEEE TRANSACTIONS ON MULTIMEDIA, 2022, 24 : 1221 - 1232
  • [22] MGSGA: Multi-grained and Semantic-Guided Alignment for Text-Video Retrieval
    Xiaoyu Wu
    Jiayao Qian
    Lulu Yang
    Neural Processing Letters, 56
  • [23] Deep Video Understanding with a Unified Multi-Modal Retrieval Framework
    Xie, Chen-Wei
    Sun, Siyang
    Zhao, Liming
    Wu, Jianmin
    Li, Dangwei
    Zheng, Yun
    PROCEEDINGS OF THE 30TH ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA, MM 2022, 2022, : 7055 - 7059
  • [24] Text-Guided Multi-Modal Fusion for Underwater Visual Tracking
    Michael, Yonathan
    Alansari, Mohamad
    Javed, Sajid
    2024 IEEE INTERNATIONAL CONFERENCE ON ADVANCED VIDEO AND SIGNAL BASED SURVEILLANCE, AVSS 2024, 2024,
  • [25] Efficient text-to-video retrieval via multi-modal multi-tagger derived pre-screening
    Yingjia Xu
    Mengxia Wu
    Zixin Guo
    Min Cao
    Mang Ye
    Jorma Laaksonen
    Visual Intelligence, 2025, 3 (1):
  • [26] MMFusion: A Generalized Multi-Modal Fusion Detection Framework
    Cui, Leichao
    Li, Xiuxian
    Meng, Min
    Mo, Xiaoyu
    2023 IEEE INTERNATIONAL CONFERENCE ON DEVELOPMENT AND LEARNING, ICDL, 2023, : 415 - 422
  • [27] Multi-Modal Cross-Domain Alignment Network for Video Moment Retrieval
    Fang, Xiang
    Liu, Daizong
    Zhou, Pan
    Hu, Yuchong
    IEEE TRANSACTIONS ON MULTIMEDIA, 2023, 25 : 7517 - 7532
  • [28] Dual-Modal Attention-Enhanced Text-Video Retrieval with Triplet Partial Margin Contrastive Learning
    Jiang, Chen
    Liu, Hong
    Yu, Xuzheng
    Wang, Qing
    Cheng, Yuan
    Xu, Jia
    Liu, Zhongyi
    Guo, Qingpei
    Chu, Wei
    Yang, Ming
    Qi, Yuan
    PROCEEDINGS OF THE 31ST ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA, MM 2023, 2023, : 4626 - 4636
  • [29] MCT-VHD: Multi-modal contrastive transformer for video highlight detection
    Jiang, Yinhui
    Luo, Sihui
    Guo, Lijun
    Zhang, Rong
    JOURNAL OF VISUAL COMMUNICATION AND IMAGE REPRESENTATION, 2024, 101
  • [30] Modeling Motion with Multi-Modal Features for Text-Based Video Segmentation
    Zhao, Wangbo
    Wang, Kai
    Chu, Xiangxiang
    Xue, Fuzhao
    Wang, Xinchao
    You, Yang
    2022 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2022, : 11727 - 11736