Utilizing Text-Video Relationships: A Text-Driven Multi-modal Fusion Framework for Moment Retrieval and Highlight Detection

被引:0
|
作者
Zhou, Siyu [1 ]
Zhang, Fjwei [2 ]
Wang, Ruomei [3 ]
Su, Zhuo [1 ]
机构
[1] Sun Yat Sen Univ, Natl Engn Res Ctr Digital Life, Sch Comp Sci & Engn, Guangzhou, Peoples R China
[2] North Univ China, Sch Comp Sci & Technol, Taiyuan, Peoples R China
[3] Sun Yat Sen Univ, Sch Software Engn, Guangzhou, Peoples R China
关键词
Multimodality; Moment retrieval; Highlight detection;
D O I
10.1007/978-981-97-8792-0_18
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Video moment retrieval and highlight detection are both text-related tasks in video understanding. Recent works primarily focus on enhancing the interaction between overall video features and query text. However, they overlook the relationships between distinct video modalities and the query text and fuse multi-modal video features in a query-agnostic manner. The overall video features obtained through this fusion method might lose information relevant to the query text, making it difficult to predict results accurately in subsequent reasoning. To address the issue, we introduce a Text-driven Integration Framework (TdIF) to fully leverage the relationships between video modalities and the query text for obtaining the enriched video representation. It fuses multi-modal video features under the guidance of the query text, effectively emphasizing query-related video information. In TdIF, we also design a query-adaptive token to enhance the interaction between the video and the query text. Furthermore, to enrich the semantic information of video representation, we introduce and leverage descriptive text of the video in a simple and efficient manner. Extensive experiments on QVHighlights, Charades-STA, TACoS and TVSum datasets validate the superiority of TdIF.
引用
收藏
页码:254 / 268
页数:15
相关论文
共 50 条
  • [1] Text-Video Retrieval via Multi-Modal Hypergraph Networks
    Li, Qian
    Su, Lixin
    Zhao, Jiashu
    Xia, Long
    Cai, Hengyi
    Cheng, Suqi
    Tang, Hengzhu
    Wang, Junfeng
    Yin, Dawei
    PROCEEDINGS OF THE 17TH ACM INTERNATIONAL CONFERENCE ON WEB SEARCH AND DATA MINING, WSDM 2024, 2024, : 369 - 377
  • [2] Multi-Modal Representation Learning with Text-Driven Soft Masks
    Park, Jaeyoo
    Han, Bohyung
    2023 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION, CVPR, 2023, : 2798 - 2807
  • [3] UMT: Unified Multi-modal Transformers for Joint Video Moment Retrieval and Highlight Detection
    Liu, Ye
    Li, Siyuan
    Wu, Yang
    Chen, Chang Wen
    Shan, Ying
    Qie, Xiaohu
    2022 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2022), 2022, : 3032 - 3041
  • [4] MIM: LIGHTWEIGHT MULTI-MODAL INTERACTION MODEL FOR JOINT VIDEO MOMENT RETRIEVAL AND HIGHLIGHT DETECTION
    Li, Jinyu
    Zhang, Fuwei
    Lin, Shujin
    Zhou, Fan
    Wang, Ruomei
    2023 IEEE INTERNATIONAL CONFERENCE ON MULTIMEDIA AND EXPO, ICME, 2023, : 1961 - 1966
  • [5] CRET: Cross-Modal Retrieval Transformer for Efficient Text-Video Retrieval
    Ji, Kaixiang
    Liu, Jiajia
    Hong, Weixiang
    Zhong, Liheng
    Wang, Jian
    Chen, Jingdong
    Chu, Wei
    PROCEEDINGS OF THE 45TH INTERNATIONAL ACM SIGIR CONFERENCE ON RESEARCH AND DEVELOPMENT IN INFORMATION RETRIEVAL (SIGIR '22), 2022, : 949 - 959
  • [6] Adversarial Multi-Grained Embedding Network for Cross-Modal Text-Video Retrieval
    Han, Ning
    Chen, Jingjing
    Zhang, Hao
    Wang, Huanwen
    Chen, Hao
    ACM TRANSACTIONS ON MULTIMEDIA COMPUTING COMMUNICATIONS AND APPLICATIONS, 2022, 18 (02)
  • [7] A cross-modal conditional mechanism based on attention for text-video retrieval
    Du, Wanru
    Jing, Xiaochuan
    Zhu, Quan
    Wang, Xiaoyin
    Liu, Xuan
    MATHEMATICAL BIOSCIENCES AND ENGINEERING, 2023, 20 (11) : 20073 - 20092
  • [8] Tagging before Alignment: Integrating Multi-Modal Tags for Video-Text Retrieval
    Chen, Yizhen
    Wang, Jie
    Lin, Lijian
    Qi, Zhongang
    Ma, Jin
    Shan, Ying
    THIRTY-SEVENTH AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE, VOL 37 NO 1, 2023, : 396 - 404
  • [9] VTLayout: A Multi-Modal Approach for Video Text Layout
    Zhao, Yuxuan
    Ma, Jin
    Qi, Zhongang
    Xie, Zehua
    Luo, Yu
    Kang, Qiusheng
    Shan, Ying
    PROCEEDINGS OF THE 31ST ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA, MM 2023, 2023, : 2775 - 2784
  • [10] Cross-Modal Learning Based on Semantic Correlation and Multi-Task Learning for Text-Video Retrieval
    Wu, Xiaoyu
    Wang, Tiantian
    Wang, Shengjin
    ELECTRONICS, 2020, 9 (12) : 1 - 17