Cross-Modal Label Contrastive Learning for Unsupervised Audio-Visual Event Localization

被引：0

作者：

Bao, Peijun ^{[1
]}

Yang, Wenhan ^{[1
,2
]}

Boon Poh Ng ^{[1
]}

Er, Meng Hwa ^{[1
]}

Kot, Alex C. ^{[1
]}

机构：

[1] Nanyang Technol Univ, Singapore, Singapore

[2] Peng Cheng Lab, Shenzhen, Peoples R China

来源：

THIRTY-SEVENTH AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE, VOL 37 NO 1 | 2023年

关键词：

D O I：

暂无

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

This paper for the first time explores audio-visual event localization in an unsupervised manner. Previous methods tackle this problem in a supervised setting and require segment-level or video-level event category ground-truth to train the model. However, building large-scale multi-modality datasets with category annotations is human-intensive and thus not scalable to real-world applications. To this end, we propose cross-modal label contrastive learning to exploit multi-modal information among unlabeled audio and visual streams as self-supervision signals. At the feature representation level, multi-modal representations are collaboratively learned from audio and visual components by using self-supervised representation learning. At the label level, we propose a novel self-supervised pretext task i.e. label contrasting to self-annotate videos with pseudo-labels for localization model training. Note that irrelevant background would hinder the acquisition of high-quality pseudo-labels and thus lead to an inferior localization model. To address this issue, we then propose an expectation-maximization algorithm that optimizes the pseudo-label acquisition and localization model in a coarse-to-fine manner. Extensive experiments demonstrate that our unsupervised approach performs reasonably well compared to the state-of-the-art supervised methods.

引用

下载

页码：215 / 222

页数：8

共 50 条

[21] Variational Autoencoder with CCA for Audio-Visual Cross-modal Retrieval
Zhang, Jiwei
Yu, Yi
Tang, Suhua
Wu, Jianming
Li, Wei
ACM TRANSACTIONS ON MULTIMEDIA COMPUTING COMMUNICATIONS AND APPLICATIONS, 2023, 19 (03)
[22] VISUALVOICE: Audio-Visual Speech Separation with Cross-Modal Consistency
Gao, Ruohan
Grauman, Kristen
2021 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION, CVPR 2021, 2021, : 15490 - 15500
[23] Audio-visual fingerprinting and cross-modal aggregation: Components and applications
Dunker, Peter
Gruhne, Matthias
2008 IEEE INTERNATIONAL SYMPOSIUM ON CONSUMER ELECTRONICS, VOLS 1 AND 2, 2008, : 243 - 246
[24] CATNet: Cross-modal fusion for audio-visual speech recognition
Wang, Xingmei
Mi, Jiachen
Li, Boquan
Zhao, Yixu
Meng, Jiaxiang
PATTERN RECOGNITION LETTERS, 2024, 178 : 216 - 222
[25] Audio-visual Speaker Recognition with a Cross-modal Discriminative Network
Tao, Ruijie
Das, Rohan Kumar
Li, Haizhou
INTERSPEECH 2020, 2020, : 2242 - 2246
[26] Audio-visual Generalised Zero-shot Learning with Cross-modal Attention and Language
Mercea, Otniel-Bogdan
Riesch, Lukas
Koepke, A. Sophia
Akata, Zeynep
2022 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2022, : 10543 - 10553
[27] Learning Explicit and Implicit Dual Common Subspaces for Audio-visual Cross-modal Retrieval
Zeng, Donghuo
Wu, Jianming
Hattori, Gen
Xu, Rong
Yu, Yi
ACM TRANSACTIONS ON MULTIMEDIA COMPUTING COMMUNICATIONS AND APPLICATIONS, 2023, 19 (02)
[28] Interactive Co-Learning with Cross-Modal Transformer for Audio-Visual Emotion Recognition
Takashima, Akihiko
Masumura, Ryo
Ando, Atsushi
Yamazaki, Yoshihiro
Uchida, Mihiro
Orihashi, Shota
INTERSPEECH 2022, 2022, : 4740 - 4744
[29] Looking into Your Speech: Learning Cross-modal Affinity for Audio-visual Speech Separation
Lee, Jiyoung
Chung, Soo-Whan
Kim, Sunok
Kang, Hong-Goo
Sohn, Kwanghoon
2021 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION, CVPR 2021, 2021, : 1336 - 1345
[30] Modeling implicit learning in a cross-modal audio-visual serial reaction time task
Taesler, Philipp
Jablonowski, Julia
Fu, Qiufang
Rose, Michael
COGNITIVE SYSTEMS RESEARCH, 2019, 54 : 154 - 164

← 1 2 3 4 5 →