Cross-Modal Label Contrastive Learning for Unsupervised Audio-Visual Event Localization

被引:0
|
作者
Bao, Peijun [1 ]
Yang, Wenhan [1 ,2 ]
Boon Poh Ng [1 ]
Er, Meng Hwa [1 ]
Kot, Alex C. [1 ]
机构
[1] Nanyang Technol Univ, Singapore, Singapore
[2] Peng Cheng Lab, Shenzhen, Peoples R China
关键词
D O I
暂无
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
This paper for the first time explores audio-visual event localization in an unsupervised manner. Previous methods tackle this problem in a supervised setting and require segment-level or video-level event category ground-truth to train the model. However, building large-scale multi-modality datasets with category annotations is human-intensive and thus not scalable to real-world applications. To this end, we propose cross-modal label contrastive learning to exploit multi-modal information among unlabeled audio and visual streams as self-supervision signals. At the feature representation level, multi-modal representations are collaboratively learned from audio and visual components by using self-supervised representation learning. At the label level, we propose a novel self-supervised pretext task i.e. label contrasting to self-annotate videos with pseudo-labels for localization model training. Note that irrelevant background would hinder the acquisition of high-quality pseudo-labels and thus lead to an inferior localization model. To address this issue, we then propose an expectation-maximization algorithm that optimizes the pseudo-label acquisition and localization model in a coarse-to-fine manner. Extensive experiments demonstrate that our unsupervised approach performs reasonably well compared to the state-of-the-art supervised methods.
引用
下载
收藏
页码:215 / 222
页数:8
相关论文
共 50 条
  • [21] Variational Autoencoder with CCA for Audio-Visual Cross-modal Retrieval
    Zhang, Jiwei
    Yu, Yi
    Tang, Suhua
    Wu, Jianming
    Li, Wei
    ACM TRANSACTIONS ON MULTIMEDIA COMPUTING COMMUNICATIONS AND APPLICATIONS, 2023, 19 (03)
  • [22] VISUALVOICE: Audio-Visual Speech Separation with Cross-Modal Consistency
    Gao, Ruohan
    Grauman, Kristen
    2021 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION, CVPR 2021, 2021, : 15490 - 15500
  • [23] Audio-visual fingerprinting and cross-modal aggregation: Components and applications
    Dunker, Peter
    Gruhne, Matthias
    2008 IEEE INTERNATIONAL SYMPOSIUM ON CONSUMER ELECTRONICS, VOLS 1 AND 2, 2008, : 243 - 246
  • [24] CATNet: Cross-modal fusion for audio-visual speech recognition
    Wang, Xingmei
    Mi, Jiachen
    Li, Boquan
    Zhao, Yixu
    Meng, Jiaxiang
    PATTERN RECOGNITION LETTERS, 2024, 178 : 216 - 222
  • [25] Audio-visual Speaker Recognition with a Cross-modal Discriminative Network
    Tao, Ruijie
    Das, Rohan Kumar
    Li, Haizhou
    INTERSPEECH 2020, 2020, : 2242 - 2246
  • [26] Audio-visual Generalised Zero-shot Learning with Cross-modal Attention and Language
    Mercea, Otniel-Bogdan
    Riesch, Lukas
    Koepke, A. Sophia
    Akata, Zeynep
    2022 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2022, : 10543 - 10553
  • [27] Learning Explicit and Implicit Dual Common Subspaces for Audio-visual Cross-modal Retrieval
    Zeng, Donghuo
    Wu, Jianming
    Hattori, Gen
    Xu, Rong
    Yu, Yi
    ACM TRANSACTIONS ON MULTIMEDIA COMPUTING COMMUNICATIONS AND APPLICATIONS, 2023, 19 (02)
  • [28] Interactive Co-Learning with Cross-Modal Transformer for Audio-Visual Emotion Recognition
    Takashima, Akihiko
    Masumura, Ryo
    Ando, Atsushi
    Yamazaki, Yoshihiro
    Uchida, Mihiro
    Orihashi, Shota
    INTERSPEECH 2022, 2022, : 4740 - 4744
  • [29] Looking into Your Speech: Learning Cross-modal Affinity for Audio-visual Speech Separation
    Lee, Jiyoung
    Chung, Soo-Whan
    Kim, Sunok
    Kang, Hong-Goo
    Sohn, Kwanghoon
    2021 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION, CVPR 2021, 2021, : 1336 - 1345
  • [30] Modeling implicit learning in a cross-modal audio-visual serial reaction time task
    Taesler, Philipp
    Jablonowski, Julia
    Fu, Qiufang
    Rose, Michael
    COGNITIVE SYSTEMS RESEARCH, 2019, 54 : 154 - 164