Video Temporal Grounding with Multi-Model Collaborative Learning

被引:0
|
作者
Tian, Yun [1 ]
Guo, Xiaobo [1 ]
Wang, Jinsong [1 ]
Li, Bin [2 ]
Zhou, Shoujun [2 ]
机构
[1] Changchun Univ Sci & Technol, Sch Optoelect Engn, Changchun 130022, Peoples R China
[2] Chinese Acad Sci, Shenzhen Inst Adv Technol, Shenzhen 518055, Peoples R China
来源
APPLIED SCIENCES-BASEL | 2025年 / 15卷 / 06期
关键词
video temporal grounding; collaborative learning; pseudo-label; iterative training; HIERARCHY;
D O I
10.3390/app15063072
中图分类号
O6 [化学];
学科分类号
0703 ;
摘要
Given an untrimmed video and a natural language query, the video temporal grounding task aims to accurately locate the target segment within the video. Functioning as a critical conduit between computer vision and natural language processing, this task holds profound importance in advancing video comprehension. Current research predominantly centers on enhancing the performance of individual models, thereby overlooking the extensive possibilities afforded by multi-model synergy. While knowledge flow methods have been adopted for multi-model and cross-modal collaborative learning, several critical concerns persist, including the unidirectional transfer of knowledge, low-quality pseudo-label generation, and gradient conflicts inherent in cooperative training. To address these issues, this research proposes a Multi-Model Collaborative Learning (MMCL) framework. By incorporating a bidirectional knowledge transfer paradigm, the MMCL framework empowers models to engage in collaborative learning through the interchange of pseudo-labels. Concurrently, the mechanism for generating pseudo-labels is optimized using the CLIP model's prior knowledge, bolstering both the accuracy and coherence of these labels while efficiently discarding extraneous temporal fragments. The framework also integrates an iterative training algorithm for multi-model collaboration, mitigating gradient conflicts through alternate optimization and achieving a dynamic balance between collaborative and independent learning. Empirical evaluations across multiple benchmark datasets indicate that the MMCL framework markedly elevates the performance of video temporal grounding models, exceeding existing state-of-the-art approaches in terms of mIoU and Rank@1. Concurrently, the framework accommodates both homogeneous and heterogeneous model configurations, demonstrating its broad versatility and adaptability. This investigation furnishes an effective avenue for multi-model collaborative learning in video temporal grounding, bolstering efficient knowledge dissemination and charting novel pathways in the domain of video comprehension.
引用
收藏
页数:27
相关论文
共 50 条
  • [1] Collaborative Debias Strategy for Temporal Sentence Grounding in Video
    Qi, Zhaobo
    Yuan, Yibo
    Ruan, Xiaowen
    Wang, Shuhui
    Zhang, Weigang
    Huang, Qingming
    IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, 2024, 34 (11) : 10972 - 10986
  • [2] Multi-model deep learning approach for collaborative filtering recommendation system
    Aljunid, Mohammed Fadhel
    Huchaiah, Manjaiah Doddaghatta
    CAAI TRANSACTIONS ON INTELLIGENCE TECHNOLOGY, 2020, 5 (04) : 268 - 275
  • [3] ProTeGe: Untrimmed Pretraining for Video Temporal Grounding by Video Temporal Grounding
    Wang, Lan
    Mittal, Gaurav
    Sajeev, Sandra
    Yu, Ye
    Hall, Matthew
    Boddeti, Vishnu Naresh
    Chen, Mei
    2023 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION, CVPR, 2023, : 6575 - 6585
  • [4] An embedded multi-model video encoder
    Meng Qinglei
    Jiang Li
    Li Wei
    2006 CHINESE CONTROL CONFERENCE, VOLS 1-5, 2006, : 938 - +
  • [5] Deconfounded Multimodal Learning for Spatio-temporal Video Grounding
    Wang, Jiawei
    Ma, Zhanchang
    Cao, Da
    Le, Yuquan
    Xiao, Junbin
    Chua, Tat-Seng
    PROCEEDINGS OF THE 31ST ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA, MM 2023, 2023, : 7521 - 7529
  • [6] End-to-end Multi-modal Video Temporal Grounding
    Chen, Yi-Wen
    Tsai, Yi-Hsuan
    Yang, Ming-Hsuan
    ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 34 (NEURIPS 2021), 2021, 34
  • [7] Learning Sample Importance for Cross-Scenario Video Temporal Grounding
    Bao, Peijun
    Mu, Yadong
    PROCEEDINGS OF THE 2022 INTERNATIONAL CONFERENCE ON MULTIMEDIA RETRIEVAL, ICMR 2022, 2022, : 322 - 329
  • [8] Learning Feature Semantic Matching for Spatio-Temporal Video Grounding
    Zhang, Tong
    Fang, Hao
    Zhang, Hao
    Gao, Jialin
    Lu, Xiankai
    Nie, Xiushan
    Yin, Yilong
    IEEE TRANSACTIONS ON MULTIMEDIA, 2024, 26 : 9268 - 9279
  • [9] End-to-end Multi-task Learning Framework for Spatio-Temporal Grounding in Video Corpus
    Gao, Yingqi
    Luo, Zhiling
    Chen, Shiqian
    Zhou, Wei
    PROCEEDINGS OF THE 31ST ACM INTERNATIONAL CONFERENCE ON INFORMATION AND KNOWLEDGE MANAGEMENT, CIKM 2022, 2022, : 3958 - 3962
  • [10] Reiniforcement learning approach of a multi-model controller
    Al-Akhras, MA
    Proceedings of the IASTED International Conference on Computational Intelligence, 2005, : 244 - 249