Cross-Modal Label Contrastive Learning for Unsupervised Audio-Visual Event Localization

被引：0

作者：

Bao, Peijun ^{[1
]}

Yang, Wenhan ^{[1
,2
]}

Boon Poh Ng ^{[1
]}

Er, Meng Hwa ^{[1
]}

Kot, Alex C. ^{[1
]}

机构：

[1] Nanyang Technol Univ, Singapore, Singapore

[2] Peng Cheng Lab, Shenzhen, Peoples R China

来源：

THIRTY-SEVENTH AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE, VOL 37 NO 1 | 2023年

关键词：

D O I：

暂无

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

This paper for the first time explores audio-visual event localization in an unsupervised manner. Previous methods tackle this problem in a supervised setting and require segment-level or video-level event category ground-truth to train the model. However, building large-scale multi-modality datasets with category annotations is human-intensive and thus not scalable to real-world applications. To this end, we propose cross-modal label contrastive learning to exploit multi-modal information among unlabeled audio and visual streams as self-supervision signals. At the feature representation level, multi-modal representations are collaboratively learned from audio and visual components by using self-supervised representation learning. At the label level, we propose a novel self-supervised pretext task i.e. label contrasting to self-annotate videos with pseudo-labels for localization model training. Note that irrelevant background would hinder the acquisition of high-quality pseudo-labels and thus lead to an inferior localization model. To address this issue, we then propose an expectation-maximization algorithm that optimizes the pseudo-label acquisition and localization model in a coarse-to-fine manner. Extensive experiments demonstrate that our unsupervised approach performs reasonably well compared to the state-of-the-art supervised methods.

引用

页码：215 / 222

页数：8

共 50 条

[1] Temporal Cross-Modal Attention for Audio-Visual Event Localization
Nagasaki Y.
Hayashi M.
Kaneko N.
Aoki Y.
[J]. Seimitsu Kogaku Kaishi/Journal of the Japan Society for Precision Engineering, 2022, 88 (03): : 263 - 268
[2] Cross-modal Background Suppression for Audio-Visual Event Localization
Xia, Yan
Zhao, Zhou
[J]. 2022 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2022), 2022, : 19957 - 19966
[3] SCLAV: Supervised Cross-modal Contrastive Learning for Audio-Visual Coding
Sun, Chao
Chen, Min
Cheng, Jialiang
Liang, Han
Zhu, Chuanbo
Chen, Jincai
[J]. PROCEEDINGS OF THE 31ST ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA, MM 2023, 2023, : 261 - 270
[4] Audio-Visual Event Localization based on Cross-Modal Interacting Guidance
Yue, Qiurui
Wu, Xiaoyu
Gao, Jiayi
[J]. 2021 IEEE FOURTH INTERNATIONAL CONFERENCE ON ARTIFICIAL INTELLIGENCE AND KNOWLEDGE ENGINEERING (AIKE 2021), 2021, : 104 - 107
[5] Cross-Modal Relation-Aware Networks for Audio-Visual Event Localization
Xu, Haoming
Zeng, Runhao
Wu, Qingyao
Tan, Mingkui
Gan, Chuang
[J]. MM '20: PROCEEDINGS OF THE 28TH ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA, 2020, : 3893 - 3901
[6] Cross-Modal Attention Network for Temporal Inconsistent Audio-Visual Event Localization
Xuan, Hanyu
Zhang, Zhenyu
Chen, Shuo
Yang, Jian
Yan, Yan
[J]. THIRTY-FOURTH AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE, THE THIRTY-SECOND INNOVATIVE APPLICATIONS OF ARTIFICIAL INTELLIGENCE CONFERENCE AND THE TENTH AAAI SYMPOSIUM ON EDUCATIONAL ADVANCES IN ARTIFICIAL INTELLIGENCE, 2020, 34 : 279 - 286
[7] Cross-Modal learning for Audio-Visual Video Parsing
Lamba, Jatin
Abhishek
Akula, Jayaprakash
Dabral, Rishabh
Jyothi, Preethi
Ramakrishnan, Ganesh
[J]. INTERSPEECH 2021, 2021, : 1937 - 1941
[8] Effect of Uncertainty in Audio-Visual Cross-Modal Statistical Learning
Nagy, Marton
Reguly, Helga
Markus, Benjamin
Fiser, Jozsef
[J]. PERCEPTION, 2019, 48 : 109 - 109
[9] Deep Cross-Modal Audio-Visual Generation
Chen, Lele
Srivastava, Sudhanshu
Duan, Zhiyao
Xu, Chenliang
[J]. PROCEEDINGS OF THE THEMATIC WORKSHOPS OF ACM MULTIMEDIA 2017 (THEMATIC WORKSHOPS'17), 2017, : 349 - 357
[10] Cross-modal prediction in audio-visual communication
Rao, RR
Chen, TH
[J]. 1996 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING, CONFERENCE PROCEEDINGS, VOLS 1-6, 1996, : 2056 - 2059

← 1 2 3 4 5 →