Exploring Cross-Video and Cross-Modality Signals for Weakly-Supervised Audio-Visual Video Parsing

被引:0
|
作者
Lin, Yan-Bo [1 ,2 ]
Tseng, Hung-Yu [3 ]
Lee, Hsin-Ying [4 ]
Lin, Yen-Yu [1 ]
Yang, Ming-Hsuan [5 ,6 ]
机构
[1] Natl Yang Ming Chiao Tung Univ, Hsinchu, Taiwan
[2] Univ N Carolina, Chapel Hill, NC 27515 USA
[3] UC Merced, Merced, CA USA
[4] Snap Res, Merced, CA USA
[5] Google Res, Mountain View, CA USA
[6] Yonsei Univ, Seoul, South Korea
关键词
D O I
暂无
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
The audio-visual video parsing task aims to temporally parse a video into audio or visual event categories. However, it is labor-intensive to temporally annotate audio and visual events and thus hampers the learning of a parsing model. To this end, we propose to explore additional cross-video and cross-modality supervisory signals to facilitate weakly-supervised audio-visual video parsing. The proposed method exploits both the common and diverse event semantics across videos to identify audio or visual events. In addition, our method explores event co-occurrence across audio, visual, and audio-visual streams. We leverage the explored cross-modality co-occurrence to localize segments of target events while excluding irrelevant ones. The discovered supervisory signals across different videos and modalities can greatly facilitate the training with only video-level annotations. Quantitative and qualitative results demonstrate that the proposed method performs favorably against existing methods on weakly-supervised audio-visual video parsing.
引用
收藏
页数:13
相关论文
共 50 条
  • [1] Exploring Heterogeneous Clues for Weakly-Supervised Audio-Visual Video Parsing
    Wu, Yu
    Yang, Yi
    [J]. 2021 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION, CVPR 2021, 2021, : 1326 - 1335
  • [2] Boosting Positive Segments for Weakly-Supervised Audio-Visual Video Parsing
    Rachavarapu, Kranthi Kumar
    Rajagopalan, A. N.
    [J]. 2023 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2023), 2023, : 10158 - 10168
  • [3] Revisit Weakly-Supervised Audio-Visual Video Parsing from the Language Perspective
    Fan, Yingying
    Wu, Yu
    Du, Bo
    Lin, Yutian
    [J]. ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 36 (NEURIPS 2023), 2023,
  • [4] Revisit Weakly-Supervised Audio-Visual Video Parsing from the Language Perspective
    School of Computer Science, Hubei Luojia Laboratory, Wuhan University, China
    [J]. Adv. neural inf. proces. syst., 1600,
  • [5] Exploring Denoised Cross-video Contrast for Weakly-supervised Temporal Action Localization
    Li, Jingjing
    Yang, Tianyu
    Ji, Wei
    Wang, Jue
    Cheng, Li
    [J]. 2022 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2022), 2022, : 19882 - 19892
  • [6] Multi-modal Grouping Network for Weakly-Supervised Audio-Visual Video Parsing
    Mo, Shentong
    Tian, Yapeng
    [J]. ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 35 (NEURIPS 2022), 2022,
  • [7] DHHN: Dual Hierarchical Hybrid Network for Weakly-Supervised Audio-Visual Video Parsing
    Jiang, Xun
    Xu, Xing
    Chen, Zhiguo
    Zhang, Jingran
    Song, Jingkuan
    Shen, Fumin
    Lu, Huimin
    Shen, Heng Tao
    [J]. PROCEEDINGS OF THE 30TH ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA, MM 2022, 2022,
  • [8] Joint-Modal Label Denoising for Weakly-Supervised Audio-Visual Video Parsing
    Cheng, Haoyue
    Liu, Zhaoyang
    Zhou, Hang
    Qian, Chen
    Wu, Wayne
    Wang, Limin
    [J]. COMPUTER VISION, ECCV 2022, PT XXXIV, 2022, 13694 : 431 - 448
  • [9] Multimodal Imbalance-Aware Gradient Modulation for Weakly-Supervised Audio-Visual Video Parsing
    Fu, Jie
    Gao, Junyu
    Bao, Bing-Kun
    Xu, Changsheng
    [J]. IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, 2024, 34 (06) : 4843 - 4856
  • [10] Multi-Level Signal Fusion for Enhanced Weakly-Supervised Audio-Visual Video Parsing
    Sun, Xin
    Wang, Xuan
    Liu, Qiong
    Zhou, Xi
    [J]. IEEE SIGNAL PROCESSING LETTERS, 2024, 31 : 1149 - 1153