CSS-Net: A Consistent Segment Selection Network for Audio-Visual Event Localization

被引：2

作者：

Feng, Fan ^{[1
]}

Ming, Yue ^{[1
]}

Hu, Nannan ^{[1
]}

Yu, Hui ^{[2
]}

Liu, Yuanan ^{[1
]}

机构：

[1] Beijing Univ Posts & Telecommun, Sch Elect Engn, Beijing Key Lab Work Safety Intelligent Monitorin, Beijing 100876, Peoples R China

[2] Univ Portsmouth, Sch Creat Technol, Portsmouth PO1 2DJ, England

来源：

IEEE TRANSACTIONS ON MULTIMEDIA | 2024年 / 26卷

基金：

北京市自然科学基金;

关键词：

Visualization; Semantics; Location awareness; Videos; Feature extraction; Correlation; Task analysis; Attention mechanism; audio-visual event localization; multi-modal learning; ATTENTION NETWORK;

D O I：

10.1109/TMM.2023.3270624

中图分类号：

TP [自动化技术、计算机技术];

学科分类号：

0812 ;

摘要：

Audio-visual event (AVE) localization aims to localize the temporal boundaries of events that contains visual and audio contents, to identify event categories in unconstrained videos. Existing work usually utilizes successive video segments for temporal modeling. However, ambient sounds or irrelevant visual targets in some segments often cause the problem of audio-visual semantics inconsistency, resulting in inaccurate global event modeling. To tackle this issue, we present a consistent segment selection network (CSS-Net) in this paper. First, we propose a novel bidirectional guided co-attention (BGCA) block, containing two distinct attention paths from audio to vision and from vision to audio, to focus on sound-related visual regions and event-related sound segments. Then, we propose a novel context-aware similarity measure (CASM) module to select semantic consistent visual and audio segments. A cross-correlation matrix is constructed using the correlation coefficients between the visual and audio feature pairs in all time steps. By extracting highly correlated segments and discarding low correlated segments, visual and audio features can learn global event semantics in videos. Finally, we propose a novel audio-visual contrastive loss to learn the similar semantics representation for visual and audio global features under the constraints of cosine and L2 similarities. Extensive experiments on public AVE dataset demonstrates the effectiveness of our proposed CSS-Net. The localization accuracies achieve the best performance of 80.5% and 76.8% in both fully- and weakly-supervised settings compared with other state-of-the-art methods.

引用

页码：701 / 713

页数：13

共 50 条

[41] Self-Supervised Video Representation and Temporally Adaptive Attention for Audio-Visual Event Localization
Ran, Yue
Tang, Hongying
Li, Baoqing
Wang, Guohui
APPLIED SCIENCES-BASEL, 2022, 12 (24):
[42] Segment-level event perception with semantic dictionary for weakly supervised audio-visual video parsing
Xie, Zhuyang
Yang, Yan
Yu, Yankai
Wang, Jie
Liu, Yan
Jiang, Yongquan
KNOWLEDGE-BASED SYSTEMS, 2025, 310
[43] AUDIO-VISUAL SPEAKER LOCALIZATION VIA WEIGHTED CLUSTERING
Gebru, Israel D.
Alameda-Pineda, Xavier
Horaud, Radu
Forbes, Florence
2014 IEEE INTERNATIONAL WORKSHOP ON MACHINE LEARNING FOR SIGNAL PROCESSING (MLSP), 2014,
[44] Tracking atoms with particles for audio-visual source localization
Monaci, Gianluca
Vandergheynst, Pierre
Maggio, Emilio
Cavallaro, Andrea
2007 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING, VOL II, PTS 1-3, 2007, : 753 - +
[45] Audio-visual speaker localization using graphical models
Kushal, Akash
Rahurkar, Mandar
Li Fei-Fei
Ponce, Jean
Huang, Thomas
18TH INTERNATIONAL CONFERENCE ON PATTERN RECOGNITION, VOL 1, PROCEEDINGS, 2006, : 291 - +
[46] Audio-Visual Localization by Synthetic Acoustic Image Generation
Sanguineti, Valentina
Morerio, Pietro
Del Bue, Alessio
Murino, Vittorio
THIRTY-FIFTH AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE, THIRTY-THIRD CONFERENCE ON INNOVATIVE APPLICATIONS OF ARTIFICIAL INTELLIGENCE AND THE ELEVENTH SYMPOSIUM ON EDUCATIONAL ADVANCES IN ARTIFICIAL INTELLIGENCE, 2021, 35 : 2523 - 2531
[47] Distributed audio-visual archives network (DiVAN)
Tirakis, A
Katalagarianos, P
Papathomas, M
Hamilakis, C
IEEE INTERNATIONAL CONFERENCE ON MULTIMEDIA COMPUTING AND SYSTEMS, PROCEEDINGS VOL 2, 1999, : 1086 - 1088
[48] Integrated audio-visual processing for object localization and tracking
Pingali, GS
MULTIMEDIA COMPUTING AND NETWORKING 1998, 1997, 3310 : 206 - 213
[49] Onmidirectional audio-visual talker localization based on dynamic fusion of audio-visual features using validity and reliability criteria
Denda, Yuki
Nishiura, Takanobu
Yamashita, Yoichi
IEICE TRANSACTIONS ON INFORMATION AND SYSTEMS, 2008, E91D (03): : 598 - 606
[50] Audio-Visual Event Classification via Spatial-Temporal-Audio Words
Cao, Yu
Baang, Sung
Liu, Shih-Hsi 'Alex'
Li, Ming
Hu, Sanqing
19TH INTERNATIONAL CONFERENCE ON PATTERN RECOGNITION, VOLS 1-6, 2008, : 858 - +

← 1 2 3 4 5 →