CSS-Net: A Consistent Segment Selection Network for Audio-Visual Event Localization

被引:2
|
作者
Feng, Fan [1 ]
Ming, Yue [1 ]
Hu, Nannan [1 ]
Yu, Hui [2 ]
Liu, Yuanan [1 ]
机构
[1] Beijing Univ Posts & Telecommun, Sch Elect Engn, Beijing Key Lab Work Safety Intelligent Monitorin, Beijing 100876, Peoples R China
[2] Univ Portsmouth, Sch Creat Technol, Portsmouth PO1 2DJ, England
基金
北京市自然科学基金;
关键词
Visualization; Semantics; Location awareness; Videos; Feature extraction; Correlation; Task analysis; Attention mechanism; audio-visual event localization; multi-modal learning; ATTENTION NETWORK;
D O I
10.1109/TMM.2023.3270624
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
Audio-visual event (AVE) localization aims to localize the temporal boundaries of events that contains visual and audio contents, to identify event categories in unconstrained videos. Existing work usually utilizes successive video segments for temporal modeling. However, ambient sounds or irrelevant visual targets in some segments often cause the problem of audio-visual semantics inconsistency, resulting in inaccurate global event modeling. To tackle this issue, we present a consistent segment selection network (CSS-Net) in this paper. First, we propose a novel bidirectional guided co-attention (BGCA) block, containing two distinct attention paths from audio to vision and from vision to audio, to focus on sound-related visual regions and event-related sound segments. Then, we propose a novel context-aware similarity measure (CASM) module to select semantic consistent visual and audio segments. A cross-correlation matrix is constructed using the correlation coefficients between the visual and audio feature pairs in all time steps. By extracting highly correlated segments and discarding low correlated segments, visual and audio features can learn global event semantics in videos. Finally, we propose a novel audio-visual contrastive loss to learn the similar semantics representation for visual and audio global features under the constraints of cosine and L2 similarities. Extensive experiments on public AVE dataset demonstrates the effectiveness of our proposed CSS-Net. The localization accuracies achieve the best performance of 80.5% and 76.8% in both fully- and weakly-supervised settings compared with other state-of-the-art methods.
引用
收藏
页码:701 / 713
页数:13
相关论文
共 50 条
  • [1] Dual Perspective Network for Audio-Visual Event Localization
    Rao, Varshanth
    Khalil, Md Ibrahim
    Li, Haoda
    Dai, Peng
    Lu, Juwei
    COMPUTER VISION, ECCV 2022, PT XXXIV, 2022, 13694 : 689 - 704
  • [2] Dynamic interactive learning network for audio-visual event localization
    Chen, Jincai
    Liang, Han
    Wang, Ruili
    Zeng, Jiangfeng
    Lu, Ping
    APPLIED INTELLIGENCE, 2023, 53 (24) : 30431 - 30442
  • [3] Dense Modality Interaction Network for Audio-Visual Event Localization
    Liu, Shuo
    Quan, Weize
    Wang, Chaoqun
    Liu, Yuan
    Liu, Bin
    Yan, Dong-Ming
    IEEE TRANSACTIONS ON MULTIMEDIA, 2023, 25 : 2734 - 2748
  • [4] Dynamic interactive learning network for audio-visual event localization
    Jincai Chen
    Han Liang
    Ruili Wang
    Jiangfeng Zeng
    Ping Lu
    Applied Intelligence, 2023, 53 : 30431 - 30442
  • [5] Multi-Relation Learning Network for audio-visual event localization
    Zhang, Pufen
    Wang, Jiaxiang
    Wan, Meng
    Chang, Sijie
    Ding, Lianhong
    Shi, Peng
    KNOWLEDGE-BASED SYSTEMS, 2025, 310
  • [6] Audio-Visual Event Localization in Unconstrained Videos
    Tian, Yapeng
    Shi, Jing
    Li, Bochen
    Duan, Zhiyao
    Xu, Chenliang
    COMPUTER VISION - ECCV 2018, PT II, 2018, 11206 : 252 - 268
  • [7] BI-DIRECTIONAL MODALITY FUSION NETWORK FOR AUDIO-VISUAL EVENT LOCALIZATION
    Liu, Shuo
    Quan, Weize
    Liu, Yuan
    Yan, Dong-Ming
    2022 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2022, : 4868 - 4872
  • [8] Learning Event-Specific Localization Preferences for Audio-Visual Event Localization
    Ge, Shiping
    Jiang, Zhiwei
    Yin, Yafeng
    Wang, Cong
    Cheng, Zifeng
    Gu, Qing
    PROCEEDINGS OF THE 31ST ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA, MM 2023, 2023, : 3446 - 3454
  • [9] Semantic and Relation Modulation for Audio-Visual Event Localization
    Wang, Hao
    Zha, Zheng-Jun
    Li, Liang
    Chen, Xuejin
    Luo, Jiebo
    IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, 2023, 45 (06) : 7711 - 7725
  • [10] Dual Attention Matching for Audio-Visual Event Localization
    Wu, Yu
    Zhu, Linchao
    Yan, Yan
    Yang, Yi
    2019 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2019), 2019, : 6301 - 6309