Utilizing Text-Video Relationships: A Text-Driven Multi-modal Fusion Framework for Moment Retrieval and Highlight Detection

被引：0

作者：

Zhou, Siyu ^{[1
]}

Zhang, Fjwei ^{[2
]}

Wang, Ruomei ^{[3
]}

Su, Zhuo ^{[1
]}

机构：

[1] Sun Yat Sen Univ, Natl Engn Res Ctr Digital Life, Sch Comp Sci & Engn, Guangzhou, Peoples R China

[2] North Univ China, Sch Comp Sci & Technol, Taiyuan, Peoples R China

[3] Sun Yat Sen Univ, Sch Software Engn, Guangzhou, Peoples R China

来源：

PATTERN RECOGNITION AND COMPUTER VISION, PRCV 2024, PT X | 2025年 / 15040卷

关键词：

Multimodality; Moment retrieval; Highlight detection;

D O I：

10.1007/978-981-97-8792-0_18

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

Video moment retrieval and highlight detection are both text-related tasks in video understanding. Recent works primarily focus on enhancing the interaction between overall video features and query text. However, they overlook the relationships between distinct video modalities and the query text and fuse multi-modal video features in a query-agnostic manner. The overall video features obtained through this fusion method might lose information relevant to the query text, making it difficult to predict results accurately in subsequent reasoning. To address the issue, we introduce a Text-driven Integration Framework (TdIF) to fully leverage the relationships between video modalities and the query text for obtaining the enriched video representation. It fuses multi-modal video features under the guidance of the query text, effectively emphasizing query-related video information. In TdIF, we also design a query-adaptive token to enhance the interaction between the video and the query text. Furthermore, to enrich the semantic information of video representation, we introduce and leverage descriptive text of the video in a simple and efficient manner. Extensive experiments on QVHighlights, Charades-STA, TACoS and TVSum datasets validate the superiority of TdIF.

引用

页码：254 / 268

页数：15

共 50 条

[21] Cross-Modal Dynamic Networks for Video Moment Retrieval With Text Query
Wang, Gongmian
Xu, Xing
Shen, Fumin
Lu, Huimin
Ji, Yanli
Shen, Heng Tao
IEEE TRANSACTIONS ON MULTIMEDIA, 2022, 24 : 1221 - 1232
[22] MGSGA: Multi-grained and Semantic-Guided Alignment for Text-Video Retrieval
Xiaoyu Wu
Jiayao Qian
Lulu Yang
Neural Processing Letters, 56
[23] Deep Video Understanding with a Unified Multi-Modal Retrieval Framework
Xie, Chen-Wei
Sun, Siyang
Zhao, Liming
Wu, Jianmin
Li, Dangwei
Zheng, Yun
PROCEEDINGS OF THE 30TH ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA, MM 2022, 2022, : 7055 - 7059
[24] Text-Guided Multi-Modal Fusion for Underwater Visual Tracking
Michael, Yonathan
Alansari, Mohamad
Javed, Sajid
2024 IEEE INTERNATIONAL CONFERENCE ON ADVANCED VIDEO AND SIGNAL BASED SURVEILLANCE, AVSS 2024, 2024,
[25] Efficient text-to-video retrieval via multi-modal multi-tagger derived pre-screening
Yingjia Xu
Mengxia Wu
Zixin Guo
Min Cao
Mang Ye
Jorma Laaksonen
Visual Intelligence, 2025, 3 (1):
[26] MMFusion: A Generalized Multi-Modal Fusion Detection Framework
Cui, Leichao
Li, Xiuxian
Meng, Min
Mo, Xiaoyu
2023 IEEE INTERNATIONAL CONFERENCE ON DEVELOPMENT AND LEARNING, ICDL, 2023, : 415 - 422
[27] Multi-Modal Cross-Domain Alignment Network for Video Moment Retrieval
Fang, Xiang
Liu, Daizong
Zhou, Pan
Hu, Yuchong
IEEE TRANSACTIONS ON MULTIMEDIA, 2023, 25 : 7517 - 7532
[28] Dual-Modal Attention-Enhanced Text-Video Retrieval with Triplet Partial Margin Contrastive Learning
Jiang, Chen
Liu, Hong
Yu, Xuzheng
Wang, Qing
Cheng, Yuan
Xu, Jia
Liu, Zhongyi
Guo, Qingpei
Chu, Wei
Yang, Ming
Qi, Yuan
PROCEEDINGS OF THE 31ST ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA, MM 2023, 2023, : 4626 - 4636
[29] MCT-VHD: Multi-modal contrastive transformer for video highlight detection
Jiang, Yinhui
Luo, Sihui
Guo, Lijun
Zhang, Rong
JOURNAL OF VISUAL COMMUNICATION AND IMAGE REPRESENTATION, 2024, 101
[30] Modeling Motion with Multi-Modal Features for Text-Based Video Segmentation
Zhao, Wangbo
Wang, Kai
Chu, Xiangxiang
Xue, Fuzhao
Wang, Xinchao
You, Yang
2022 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2022, : 11727 - 11736

← 1 2 3 4 5 →