CoLeaF: A Contrastive-Collaborative Learning Framework for Weakly Supervised Audio-Visual Video Parsing

被引:0
|
作者
Sardari, Faegheh [1 ]
Mustafa, Armin [1 ]
Jackson, Philip J. B. [1 ]
Hilton, Adrian [1 ]
机构
[1] Univ Surrey, Ctr Vis Speech & Signal Proc CVSSP, Guildford, Surrey, England
来源
基金
英国工程与自然科学研究理事会;
关键词
Unaligned audio-visual learning; Audio-visual video parsing; Weakly supervised learning; Event detection;
D O I
10.1007/978-3-031-73247-8_1
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Weakly supervised audio-visual video parsing (AVVP) methods aim to detect audible-only, visible-only, and audible-visible events using only video-level labels. Existing approaches tackle this by leveraging unimodal and cross-modal contexts. However, we argue that while cross-modal learning is beneficial for detecting audible-visible events, in the weakly supervised scenario, it negatively impacts unaligned audible or visible events by introducing irrelevant modality information. In this paper, we propose CoLeaF, a novel learning framework that optimizes the integration of cross-modal context in the embedding space such that the network explicitly learns to combine cross-modal information for audible-visible events while filtering them out for unaligned events. Additionally, as videos often involve complex class relationships, modelling them improves performance. However, this introduces extra computational costs into the network. Our framework is designed to leverage cross-class relationships during training without incurring additional computations at inference. Furthermore, we propose new metrics to better evaluate a method's capabilities in performing AVVP. Our extensive experiments demonstrate that CoLeaF significantly improves the state-of-the-art results by an average of 1.9% and 2.4% F-score on the LLP and UnAV-100 datasets, respectively. Code is available at: https://github.com/faeghehsardari/coleaf.
引用
收藏
页码:1 / 17
页数:17
相关论文
共 50 条
  • [1] Exploring Heterogeneous Clues for Weakly-Supervised Audio-Visual Video Parsing
    Wu, Yu
    Yang, Yi
    2021 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION, CVPR 2021, 2021, : 1326 - 1335
  • [2] Boosting Positive Segments for Weakly-Supervised Audio-Visual Video Parsing
    Rachavarapu, Kranthi Kumar
    Rajagopalan, A. N.
    2023 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2023), 2023, : 10158 - 10168
  • [3] Revisit Weakly-Supervised Audio-Visual Video Parsing from the Language Perspective
    Fan, Yingying
    Wu, Yu
    Du, Bo
    Lin, Yutian
    Advances in Neural Information Processing Systems, 2023, 36
  • [4] Revisit Weakly-Supervised Audio-Visual Video Parsing from the Language Perspective
    Fan, Yingying
    Wu, Yu
    Du, Bo
    Lin, Yutian
    ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 36 (NEURIPS 2023), 2023,
  • [5] Joint-Modal Label Denoising for Weakly-Supervised Audio-Visual Video Parsing
    Cheng, Haoyue
    Liu, Zhaoyang
    Zhou, Hang
    Qian, Chen
    Wu, Wayne
    Wang, Limin
    COMPUTER VISION, ECCV 2022, PT XXXIV, 2022, 13694 : 431 - 448
  • [6] DHHN: Dual Hierarchical Hybrid Network for Weakly-Supervised Audio-Visual Video Parsing
    Jiang, Xun
    Xu, Xing
    Chen, Zhiguo
    Zhang, Jingran
    Song, Jingkuan
    Shen, Fumin
    Lu, Huimin
    Shen, Heng Tao
    PROCEEDINGS OF THE 30TH ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA, MM 2022, 2022,
  • [7] Multi-modal Grouping Network for Weakly-Supervised Audio-Visual Video Parsing
    Mo, Shentong
    Tian, Yapeng
    ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 35 (NEURIPS 2022), 2022,
  • [8] Multi-Level Signal Fusion for Enhanced Weakly-Supervised Audio-Visual Video Parsing
    Sun, Xin
    Wang, Xuan
    Liu, Qiong
    Zhou, Xi
    IEEE SIGNAL PROCESSING LETTERS, 2024, 31 : 1149 - 1153
  • [9] Segment-level event perception with semantic dictionary for weakly supervised audio-visual video parsing
    Xie, Zhuyang
    Yang, Yan
    Yu, Yankai
    Wang, Jie
    Liu, Yan
    Jiang, Yongquan
    KNOWLEDGE-BASED SYSTEMS, 2025, 310
  • [10] Weakly-Supervised Audio-Visual Video Parsing with Prototype-based Pseudo-Labeling
    Rachavarapu, Kranthi Kumar
    Ramakrishnan, Kalyan
    Rajagopalan, A. N.
    2024 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2024, : 18952 - 18962