Multimodal Imbalance-Aware Gradient Modulation for Weakly-Supervised Audio-Visual Video Parsing

被引:0
|
作者
Fu, Jie [1 ]
Gao, Junyu [2 ,3 ]
Bao, Bing-Kun [1 ]
Xu, Changsheng [2 ,3 ,4 ]
机构
[1] Nanjing Univ Posts & Telecommun, Sch Commun & Informat Engn, Nanjing 210009, Peoples R China
[2] Chinese Acad Sci, Inst Automat, State Key Lab Multimodal Artificial Intelligence S, Beijing 100190, Peoples R China
[3] Univ Chinese Acad Sci, Sch Artificial Intelligence, Beijing 100190, Peoples R China
[4] Peng Cheng Lab, Shenzhen 518055, Peoples R China
关键词
Imbalance-aware; gradient modulation; weakly-supervised; audio-visual video parsing; ATTENTION; NETWORK;
D O I
10.1109/TCSVT.2023.3337134
中图分类号
TM [电工技术]; TN [电子技术、通信技术];
学科分类号
0808 ; 0809 ;
摘要
Weakly-supervised audio-visual video parsing (WS-AVVP) aims to localize the temporal extents of audio, visual and audio-visual event instances as well as identify the corresponding event categories with only video-level category labels for training. Most previous efforts have been devoted to refining the supervision for each modality or extracting fruitful cross-modality information for more reliable feature learning. None of them have noticed the imbalanced feature learning between different modalities in the task. In this paper, to balance the feature learning processes of different modalities, a dynamic gradient modulation (DGM) mechanism is explored, where a novel and effective metric function is designed to measure the imbalanced feature learning between audio and visual modalities. Furthermore, by going in depth into the principle of traditional WS-AVVP pipelines, two additional challenges are identified: confusing multimodal calculation will hamper the precise measurement of audio-visual imbalanced feature learning, as well as the global supervision provided by video-level labels can not provide explicit guidance for robust semantic feature learning in each action subspace. To cope with the above issues, the modality-separated decision unit (MSDU) and semantic-aware feature extractor (SAFE) are designed for precise measurement of imbalanced feature learning and unambiguous semantic-aware feature extraction separately. Comprehensive experiments are conducted on public benchmarks and the corresponding experimental results demonstrate the effectiveness of our proposed method.
引用
收藏
页码:4843 / 4856
页数:14
相关论文
共 44 条
  • [1] Exploring Heterogeneous Clues for Weakly-Supervised Audio-Visual Video Parsing
    Wu, Yu
    Yang, Yi
    [J]. 2021 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION, CVPR 2021, 2021, : 1326 - 1335
  • [2] Boosting Positive Segments for Weakly-Supervised Audio-Visual Video Parsing
    Rachavarapu, Kranthi Kumar
    Rajagopalan, A. N.
    [J]. 2023 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2023), 2023, : 10158 - 10168
  • [3] Revisit Weakly-Supervised Audio-Visual Video Parsing from the Language Perspective
    Fan, Yingying
    Wu, Yu
    Du, Bo
    Lin, Yutian
    [J]. ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 36 (NEURIPS 2023), 2023,
  • [4] Revisit Weakly-Supervised Audio-Visual Video Parsing from the Language Perspective
    School of Computer Science, Hubei Luojia Laboratory, Wuhan University, China
    [J]. Adv. neural inf. proces. syst., 1600,
  • [5] Multi-modal Grouping Network for Weakly-Supervised Audio-Visual Video Parsing
    Mo, Shentong
    Tian, Yapeng
    [J]. ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 35 (NEURIPS 2022), 2022,
  • [6] DHHN: Dual Hierarchical Hybrid Network for Weakly-Supervised Audio-Visual Video Parsing
    Jiang, Xun
    Xu, Xing
    Chen, Zhiguo
    Zhang, Jingran
    Song, Jingkuan
    Shen, Fumin
    Lu, Huimin
    Shen, Heng Tao
    [J]. PROCEEDINGS OF THE 30TH ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA, MM 2022, 2022,
  • [7] Joint-Modal Label Denoising for Weakly-Supervised Audio-Visual Video Parsing
    Cheng, Haoyue
    Liu, Zhaoyang
    Zhou, Hang
    Qian, Chen
    Wu, Wayne
    Wang, Limin
    [J]. COMPUTER VISION, ECCV 2022, PT XXXIV, 2022, 13694 : 431 - 448
  • [8] Multi-Level Signal Fusion for Enhanced Weakly-Supervised Audio-Visual Video Parsing
    Sun, Xin
    Wang, Xuan
    Liu, Qiong
    Zhou, Xi
    [J]. IEEE SIGNAL PROCESSING LETTERS, 2024, 31 : 1149 - 1153
  • [9] Weakly-Supervised Audio-Visual Segmentation
    Mo, Shentong
    Raj, Bhiksha
    [J]. ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 36 (NEURIPS 2023), 2023,
  • [10] Exploring Cross-Video and Cross-Modality Signals for Weakly-Supervised Audio-Visual Video Parsing
    Lin, Yan-Bo
    Tseng, Hung-Yu
    Lee, Hsin-Ying
    Lin, Yen-Yu
    Yang, Ming-Hsuan
    [J]. ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 34 (NEURIPS 2021), 2021, 34