Utilizing Text-Video Relationships: A Text-Driven Multi-modal Fusion Framework for Moment Retrieval and Highlight Detection

被引：0

作者：

Zhou, Siyu ^{[1
]}

Zhang, Fjwei ^{[2
]}

Wang, Ruomei ^{[3
]}

Su, Zhuo ^{[1
]}

机构：

[1] Sun Yat Sen Univ, Natl Engn Res Ctr Digital Life, Sch Comp Sci & Engn, Guangzhou, Peoples R China

[2] North Univ China, Sch Comp Sci & Technol, Taiyuan, Peoples R China

[3] Sun Yat Sen Univ, Sch Software Engn, Guangzhou, Peoples R China

来源：

PATTERN RECOGNITION AND COMPUTER VISION, PRCV 2024, PT X | 2025年 / 15040卷

关键词：

Multimodality; Moment retrieval; Highlight detection;

D O I：

10.1007/978-981-97-8792-0_18

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

Video moment retrieval and highlight detection are both text-related tasks in video understanding. Recent works primarily focus on enhancing the interaction between overall video features and query text. However, they overlook the relationships between distinct video modalities and the query text and fuse multi-modal video features in a query-agnostic manner. The overall video features obtained through this fusion method might lose information relevant to the query text, making it difficult to predict results accurately in subsequent reasoning. To address the issue, we introduce a Text-driven Integration Framework (TdIF) to fully leverage the relationships between video modalities and the query text for obtaining the enriched video representation. It fuses multi-modal video features under the guidance of the query text, effectively emphasizing query-related video information. In TdIF, we also design a query-adaptive token to enhance the interaction between the video and the query text. Furthermore, to enrich the semantic information of video representation, we introduce and leverage descriptive text of the video in a simple and efficient manner. Extensive experiments on QVHighlights, Charades-STA, TACoS and TVSum datasets validate the superiority of TdIF.

引用

页码：254 / 268

页数：15

共 50 条

[1] Text-Video Retrieval via Multi-Modal Hypergraph Networks
Li, Qian
Su, Lixin
Zhao, Jiashu
Xia, Long
Cai, Hengyi
Cheng, Suqi
Tang, Hengzhu
Wang, Junfeng
Yin, Dawei
PROCEEDINGS OF THE 17TH ACM INTERNATIONAL CONFERENCE ON WEB SEARCH AND DATA MINING, WSDM 2024, 2024, : 369 - 377
[2] Multi-Modal Representation Learning with Text-Driven Soft Masks
Park, Jaeyoo
Han, Bohyung
2023 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION, CVPR, 2023, : 2798 - 2807
[3] UMT: Unified Multi-modal Transformers for Joint Video Moment Retrieval and Highlight Detection
Liu, Ye
Li, Siyuan
Wu, Yang
Chen, Chang Wen
Shan, Ying
Qie, Xiaohu
2022 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2022), 2022, : 3032 - 3041
[4] MIM: LIGHTWEIGHT MULTI-MODAL INTERACTION MODEL FOR JOINT VIDEO MOMENT RETRIEVAL AND HIGHLIGHT DETECTION
Li, Jinyu
Zhang, Fuwei
Lin, Shujin
Zhou, Fan
Wang, Ruomei
2023 IEEE INTERNATIONAL CONFERENCE ON MULTIMEDIA AND EXPO, ICME, 2023, : 1961 - 1966
[5] CRET: Cross-Modal Retrieval Transformer for Efficient Text-Video Retrieval
Ji, Kaixiang
Liu, Jiajia
Hong, Weixiang
Zhong, Liheng
Wang, Jian
Chen, Jingdong
Chu, Wei
PROCEEDINGS OF THE 45TH INTERNATIONAL ACM SIGIR CONFERENCE ON RESEARCH AND DEVELOPMENT IN INFORMATION RETRIEVAL (SIGIR '22), 2022, : 949 - 959
[6] Adversarial Multi-Grained Embedding Network for Cross-Modal Text-Video Retrieval
Han, Ning
Chen, Jingjing
Zhang, Hao
Wang, Huanwen
Chen, Hao
ACM TRANSACTIONS ON MULTIMEDIA COMPUTING COMMUNICATIONS AND APPLICATIONS, 2022, 18 (02)
[7] A cross-modal conditional mechanism based on attention for text-video retrieval
Du, Wanru
Jing, Xiaochuan
Zhu, Quan
Wang, Xiaoyin
Liu, Xuan
MATHEMATICAL BIOSCIENCES AND ENGINEERING, 2023, 20 (11) : 20073 - 20092
[8] Tagging before Alignment: Integrating Multi-Modal Tags for Video-Text Retrieval
Chen, Yizhen
Wang, Jie
Lin, Lijian
Qi, Zhongang
Ma, Jin
Shan, Ying
THIRTY-SEVENTH AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE, VOL 37 NO 1, 2023, : 396 - 404
[9] VTLayout: A Multi-Modal Approach for Video Text Layout
Zhao, Yuxuan
Ma, Jin
Qi, Zhongang
Xie, Zehua
Luo, Yu
Kang, Qiusheng
Shan, Ying
PROCEEDINGS OF THE 31ST ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA, MM 2023, 2023, : 2775 - 2784
[10] Cross-Modal Learning Based on Semantic Correlation and Multi-Task Learning for Text-Video Retrieval
Wu, Xiaoyu
Wang, Tiantian
Wang, Shengjin
ELECTRONICS, 2020, 9 (12) : 1 - 17

← 1 2 3 4 5 →