CSS-Net: A Consistent Segment Selection Network for Audio-Visual Event Localization

被引:2
|
作者
Feng, Fan [1 ]
Ming, Yue [1 ]
Hu, Nannan [1 ]
Yu, Hui [2 ]
Liu, Yuanan [1 ]
机构
[1] Beijing Univ Posts & Telecommun, Sch Elect Engn, Beijing Key Lab Work Safety Intelligent Monitorin, Beijing 100876, Peoples R China
[2] Univ Portsmouth, Sch Creat Technol, Portsmouth PO1 2DJ, England
基金
北京市自然科学基金;
关键词
Visualization; Semantics; Location awareness; Videos; Feature extraction; Correlation; Task analysis; Attention mechanism; audio-visual event localization; multi-modal learning; ATTENTION NETWORK;
D O I
10.1109/TMM.2023.3270624
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
Audio-visual event (AVE) localization aims to localize the temporal boundaries of events that contains visual and audio contents, to identify event categories in unconstrained videos. Existing work usually utilizes successive video segments for temporal modeling. However, ambient sounds or irrelevant visual targets in some segments often cause the problem of audio-visual semantics inconsistency, resulting in inaccurate global event modeling. To tackle this issue, we present a consistent segment selection network (CSS-Net) in this paper. First, we propose a novel bidirectional guided co-attention (BGCA) block, containing two distinct attention paths from audio to vision and from vision to audio, to focus on sound-related visual regions and event-related sound segments. Then, we propose a novel context-aware similarity measure (CASM) module to select semantic consistent visual and audio segments. A cross-correlation matrix is constructed using the correlation coefficients between the visual and audio feature pairs in all time steps. By extracting highly correlated segments and discarding low correlated segments, visual and audio features can learn global event semantics in videos. Finally, we propose a novel audio-visual contrastive loss to learn the similar semantics representation for visual and audio global features under the constraints of cosine and L2 similarities. Extensive experiments on public AVE dataset demonstrates the effectiveness of our proposed CSS-Net. The localization accuracies achieve the best performance of 80.5% and 76.8% in both fully- and weakly-supervised settings compared with other state-of-the-art methods.
引用
收藏
页码:701 / 713
页数:13
相关论文
共 50 条
  • [31] Deep Audio-Visual Beamforming for Speaker Localization
    Qian, Xinyuan
    Zhang, Qiquan
    Guan, Guohui
    Xue, Wei
    IEEE SIGNAL PROCESSING LETTERS, 2022, 29 : 1132 - 1136
  • [32] Cross-Modal Relation-Aware Networks for Audio-Visual Event Localization
    Xu, Haoming
    Zeng, Runhao
    Wu, Qingyao
    Tan, Mingkui
    Gan, Chuang
    MM '20: PROCEEDINGS OF THE 28TH ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA, 2020, : 3893 - 3901
  • [33] Cross-Modal Label Contrastive Learning for Unsupervised Audio-Visual Event Localization
    Bao, Peijun
    Yang, Wenhan
    Boon Poh Ng
    Er, Meng Hwa
    Kot, Alex C.
    THIRTY-SEVENTH AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE, VOL 37 NO 1, 2023, : 215 - 222
  • [34] Span-based Audio-Visual Localization
    Wu, Yiling
    Zhang, Xinfeng
    Wang, Yaowei
    Huang, Qingming
    PROCEEDINGS OF THE 30TH ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA, MM 2022, 2022, : 1252 - 1260
  • [35] Audio-Visual Event Localization via Recursive Fusion by Joint Co-Attention
    Duan, Bin
    Tang, Hao
    Wang, Wei
    Zong, Ziliang
    Yang, Guowei
    Yan, Yan
    2021 IEEE WINTER CONFERENCE ON APPLICATIONS OF COMPUTER VISION WACV 2021, 2021, : 4012 - 4021
  • [36] Audio-Visual Salieny Network with Audio Attention Module
    Cheng, Shuaiyang
    Gao, Xing
    Song, Liang
    Xiahou, Jianbing
    PROCEEDINGS OF 2021 2ND INTERNATIONAL CONFERENCE ON ARTIFICIAL INTELLIGENCE AND INFORMATION SYSTEMS (ICAIIS '21), 2021,
  • [37] Acoustic and Visual Knowledge Distillation for Contrastive Audio-Visual Localization
    Yaghoubi, Ehsan
    Kelm, Andre
    Gerkmann, Timo
    Frintrop, Simone
    PROCEEDINGS OF THE 25TH INTERNATIONAL CONFERENCE ON MULTIMODAL INTERACTION, ICMI 2023, 2023, : 15 - 23
  • [38] AUDIO-VISUAL EVENT RECOGNITION THROUGH THE LENS OF ADVERSARY
    Li, Juncheng B.
    Ma, Kaixin
    Qu, Shuhui
    Huang, Po-Yao
    Metze, Florian
    2021 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP 2021), 2021, : 616 - 620
  • [39] Audio-visual event recognition in surveillance video sequences
    Cristani, Marco
    Bicego, Manuele
    Murino, Vittorio
    IEEE TRANSACTIONS ON MULTIMEDIA, 2007, 9 (02) : 257 - 267
  • [40] DMMAN: A two-stage audio-visual fusion framework for sound separation and event localization
    Hu, Ruihan
    Zhou, Songbing
    Tang, Zhi Ri
    Chang, Sheng
    Huang, Qijun
    Liu, Yisen
    Han, Wei
    Wu, Edmond Q.
    NEURAL NETWORKS, 2021, 133 : 229 - 239