CSS-Net: A Consistent Segment Selection Network for Audio-Visual Event Localization

被引：2

作者：

Feng, Fan ^{[1
]}

Ming, Yue ^{[1
]}

Hu, Nannan ^{[1
]}

Yu, Hui ^{[2
]}

Liu, Yuanan ^{[1
]}

机构：

[1] Beijing Univ Posts & Telecommun, Sch Elect Engn, Beijing Key Lab Work Safety Intelligent Monitorin, Beijing 100876, Peoples R China

[2] Univ Portsmouth, Sch Creat Technol, Portsmouth PO1 2DJ, England

来源：

IEEE TRANSACTIONS ON MULTIMEDIA | 2024年 / 26卷

基金：

北京市自然科学基金;

关键词：

Visualization; Semantics; Location awareness; Videos; Feature extraction; Correlation; Task analysis; Attention mechanism; audio-visual event localization; multi-modal learning; ATTENTION NETWORK;

D O I：

10.1109/TMM.2023.3270624

中图分类号：

TP [自动化技术、计算机技术];

学科分类号：

0812 ;

摘要：

Audio-visual event (AVE) localization aims to localize the temporal boundaries of events that contains visual and audio contents, to identify event categories in unconstrained videos. Existing work usually utilizes successive video segments for temporal modeling. However, ambient sounds or irrelevant visual targets in some segments often cause the problem of audio-visual semantics inconsistency, resulting in inaccurate global event modeling. To tackle this issue, we present a consistent segment selection network (CSS-Net) in this paper. First, we propose a novel bidirectional guided co-attention (BGCA) block, containing two distinct attention paths from audio to vision and from vision to audio, to focus on sound-related visual regions and event-related sound segments. Then, we propose a novel context-aware similarity measure (CASM) module to select semantic consistent visual and audio segments. A cross-correlation matrix is constructed using the correlation coefficients between the visual and audio feature pairs in all time steps. By extracting highly correlated segments and discarding low correlated segments, visual and audio features can learn global event semantics in videos. Finally, we propose a novel audio-visual contrastive loss to learn the similar semantics representation for visual and audio global features under the constraints of cosine and L2 similarities. Extensive experiments on public AVE dataset demonstrates the effectiveness of our proposed CSS-Net. The localization accuracies achieve the best performance of 80.5% and 76.8% in both fully- and weakly-supervised settings compared with other state-of-the-art methods.

引用

页码：701 / 713

页数：13

共 50 条

[31] Deep Audio-Visual Beamforming for Speaker Localization
Qian, Xinyuan
Zhang, Qiquan
Guan, Guohui
Xue, Wei
IEEE SIGNAL PROCESSING LETTERS, 2022, 29 : 1132 - 1136
[32] Cross-Modal Relation-Aware Networks for Audio-Visual Event Localization
Xu, Haoming
Zeng, Runhao
Wu, Qingyao
Tan, Mingkui
Gan, Chuang
MM '20: PROCEEDINGS OF THE 28TH ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA, 2020, : 3893 - 3901
[33] Cross-Modal Label Contrastive Learning for Unsupervised Audio-Visual Event Localization
Bao, Peijun
Yang, Wenhan
Boon Poh Ng
Er, Meng Hwa
Kot, Alex C.
THIRTY-SEVENTH AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE, VOL 37 NO 1, 2023, : 215 - 222
[34] Span-based Audio-Visual Localization
Wu, Yiling
Zhang, Xinfeng
Wang, Yaowei
Huang, Qingming
PROCEEDINGS OF THE 30TH ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA, MM 2022, 2022, : 1252 - 1260
[35] Audio-Visual Event Localization via Recursive Fusion by Joint Co-Attention
Duan, Bin
Tang, Hao
Wang, Wei
Zong, Ziliang
Yang, Guowei
Yan, Yan
2021 IEEE WINTER CONFERENCE ON APPLICATIONS OF COMPUTER VISION WACV 2021, 2021, : 4012 - 4021
[36] Audio-Visual Salieny Network with Audio Attention Module
Cheng, Shuaiyang
Gao, Xing
Song, Liang
Xiahou, Jianbing
PROCEEDINGS OF 2021 2ND INTERNATIONAL CONFERENCE ON ARTIFICIAL INTELLIGENCE AND INFORMATION SYSTEMS (ICAIIS '21), 2021,
[37] Acoustic and Visual Knowledge Distillation for Contrastive Audio-Visual Localization
Yaghoubi, Ehsan
Kelm, Andre
Gerkmann, Timo
Frintrop, Simone
PROCEEDINGS OF THE 25TH INTERNATIONAL CONFERENCE ON MULTIMODAL INTERACTION, ICMI 2023, 2023, : 15 - 23
[38] AUDIO-VISUAL EVENT RECOGNITION THROUGH THE LENS OF ADVERSARY
Li, Juncheng B.
Ma, Kaixin
Qu, Shuhui
Huang, Po-Yao
Metze, Florian
2021 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP 2021), 2021, : 616 - 620
[39] Audio-visual event recognition in surveillance video sequences
Cristani, Marco
Bicego, Manuele
Murino, Vittorio
IEEE TRANSACTIONS ON MULTIMEDIA, 2007, 9 (02) : 257 - 267
[40] DMMAN: A two-stage audio-visual fusion framework for sound separation and event localization
Hu, Ruihan
Zhou, Songbing
Tang, Zhi Ri
Chang, Sheng
Huang, Qijun
Liu, Yisen
Han, Wei
Wu, Edmond Q.
NEURAL NETWORKS, 2021, 133 : 229 - 239

← 1 2 3 4 5 →