CoLeaF: A Contrastive-Collaborative Learning Framework for Weakly Supervised Audio-Visual Video Parsing

被引:0
|
作者
Sardari, Faegheh [1 ]
Mustafa, Armin [1 ]
Jackson, Philip J. B. [1 ]
Hilton, Adrian [1 ]
机构
[1] Univ Surrey, Ctr Vis Speech & Signal Proc CVSSP, Guildford, Surrey, England
来源
基金
英国工程与自然科学研究理事会;
关键词
Unaligned audio-visual learning; Audio-visual video parsing; Weakly supervised learning; Event detection;
D O I
10.1007/978-3-031-73247-8_1
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Weakly supervised audio-visual video parsing (AVVP) methods aim to detect audible-only, visible-only, and audible-visible events using only video-level labels. Existing approaches tackle this by leveraging unimodal and cross-modal contexts. However, we argue that while cross-modal learning is beneficial for detecting audible-visible events, in the weakly supervised scenario, it negatively impacts unaligned audible or visible events by introducing irrelevant modality information. In this paper, we propose CoLeaF, a novel learning framework that optimizes the integration of cross-modal context in the embedding space such that the network explicitly learns to combine cross-modal information for audible-visible events while filtering them out for unaligned events. Additionally, as videos often involve complex class relationships, modelling them improves performance. However, this introduces extra computational costs into the network. Our framework is designed to leverage cross-class relationships during training without incurring additional computations at inference. Furthermore, we propose new metrics to better evaluate a method's capabilities in performing AVVP. Our extensive experiments demonstrate that CoLeaF significantly improves the state-of-the-art results by an average of 1.9% and 2.4% F-score on the LLP and UnAV-100 datasets, respectively. Code is available at: https://github.com/faeghehsardari/coleaf.
引用
收藏
页码:1 / 17
页数:17
相关论文
共 50 条
  • [21] Learning weakly supervised audio-visual violence detection in hyperbolic space
    Zhou, Xiao
    Peng, Xiaogang
    Wen, Hao
    Luo, Yikai
    Yu, Keyang
    Yang, Ping
    Wu, Zizhao
    IMAGE AND VISION COMPUTING, 2024, 151
  • [22] SCLAV: Supervised Cross-modal Contrastive Learning for Audio-Visual Coding
    Sun, Chao
    Chen, Min
    Cheng, Jialiang
    Liang, Han
    Zhu, Chuanbo
    Chen, Jincai
    PROCEEDINGS OF THE 31ST ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA, MM 2023, 2023, : 261 - 270
  • [23] Audio-Visual Contrastive and Consistency Learning for Semi-Supervised Action Recognition
    Assefa, Maregu
    Jiang, Wei
    Zhan, Jinyu
    Gedamu, Kumie
    Yilma, Getinet
    Ayalew, Melese
    Adhikari, Deepak
    IEEE TRANSACTIONS ON MULTIMEDIA, 2024, 26 : 3491 - 3504
  • [24] Modality-Aware Contrastive Instance Learning with Self-Distillation for Weakly-Supervised Audio-Visual Violence Detection
    Yu, Jiashuo
    Liu, Jinyu
    Cheng, Ying
    Feng, Rui
    Zhang, Yuejie
    PROCEEDINGS OF THE 30TH ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA, MM 2022, 2022, : 6278 - 6287
  • [25] Collaborative Normality Learning Framework for Weakly Supervised Video Anomaly Detection
    Liu, Yang
    Liu, Jing
    Zhao, Mengyang
    Li, Shuang
    Song, Liang
    IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS II-EXPRESS BRIEFS, 2022, 69 (05) : 2508 - 2512
  • [26] Audio-Visual Weakly Supervised Approach for Apathy Detection in the Elderly
    Sharma, Garima
    Joshi, Jyoti
    Zeghari, Radia
    Guerchouche, Rachid
    2020 INTERNATIONAL JOINT CONFERENCE ON NEURAL NETWORKS (IJCNN), 2020,
  • [27] Distilling Audio-Visual Knowledge by Compositional Contrastive Learning
    Chen, Yanbei
    Xian, Yongqin
    Koepke, A. Sophia
    Shan, Ying
    Akata, Zeynep
    2021 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION, CVPR 2021, 2021, : 7012 - 7021
  • [28] Label-Anticipated Event Disentanglement for Audio-Visual Video Parsing
    Zhou, Jinxing
    Guo, Dan
    Mao, Yuxin
    Zhong, Yiran
    Chang, Xiaojun
    Wang, Meng
    COMPUTER VISION - ECCV 2024, PT X, 2025, 15068 : 35 - 51
  • [29] A Closer Look at Weakly-Supervised Audio-Visual Source Localization
    Mo, Shentong
    Morgado, Pedro
    ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 35 (NEURIPS 2022), 2022,
  • [30] Robust Contrastive Learning Against Audio-Visual Noisy Correspondence
    Zhao, Yihan
    Xi, Wei
    Bai, Gairui
    Liu, Xinhui
    Zhao, Jizhong
    PATTERN RECOGNITION AND COMPUTER VISION, PT V, PRCV 2024, 2025, 15035 : 526 - 540