CSS-Net: A Consistent Segment Selection Network for Audio-Visual Event Localization

被引:2
|
作者
Feng, Fan [1 ]
Ming, Yue [1 ]
Hu, Nannan [1 ]
Yu, Hui [2 ]
Liu, Yuanan [1 ]
机构
[1] Beijing Univ Posts & Telecommun, Sch Elect Engn, Beijing Key Lab Work Safety Intelligent Monitorin, Beijing 100876, Peoples R China
[2] Univ Portsmouth, Sch Creat Technol, Portsmouth PO1 2DJ, England
基金
北京市自然科学基金;
关键词
Visualization; Semantics; Location awareness; Videos; Feature extraction; Correlation; Task analysis; Attention mechanism; audio-visual event localization; multi-modal learning; ATTENTION NETWORK;
D O I
10.1109/TMM.2023.3270624
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
Audio-visual event (AVE) localization aims to localize the temporal boundaries of events that contains visual and audio contents, to identify event categories in unconstrained videos. Existing work usually utilizes successive video segments for temporal modeling. However, ambient sounds or irrelevant visual targets in some segments often cause the problem of audio-visual semantics inconsistency, resulting in inaccurate global event modeling. To tackle this issue, we present a consistent segment selection network (CSS-Net) in this paper. First, we propose a novel bidirectional guided co-attention (BGCA) block, containing two distinct attention paths from audio to vision and from vision to audio, to focus on sound-related visual regions and event-related sound segments. Then, we propose a novel context-aware similarity measure (CASM) module to select semantic consistent visual and audio segments. A cross-correlation matrix is constructed using the correlation coefficients between the visual and audio feature pairs in all time steps. By extracting highly correlated segments and discarding low correlated segments, visual and audio features can learn global event semantics in videos. Finally, we propose a novel audio-visual contrastive loss to learn the similar semantics representation for visual and audio global features under the constraints of cosine and L2 similarities. Extensive experiments on public AVE dataset demonstrates the effectiveness of our proposed CSS-Net. The localization accuracies achieve the best performance of 80.5% and 76.8% in both fully- and weakly-supervised settings compared with other state-of-the-art methods.
引用
收藏
页码:701 / 713
页数:13
相关论文
共 50 条
  • [41] Self-Supervised Video Representation and Temporally Adaptive Attention for Audio-Visual Event Localization
    Ran, Yue
    Tang, Hongying
    Li, Baoqing
    Wang, Guohui
    APPLIED SCIENCES-BASEL, 2022, 12 (24):
  • [42] Segment-level event perception with semantic dictionary for weakly supervised audio-visual video parsing
    Xie, Zhuyang
    Yang, Yan
    Yu, Yankai
    Wang, Jie
    Liu, Yan
    Jiang, Yongquan
    KNOWLEDGE-BASED SYSTEMS, 2025, 310
  • [43] AUDIO-VISUAL SPEAKER LOCALIZATION VIA WEIGHTED CLUSTERING
    Gebru, Israel D.
    Alameda-Pineda, Xavier
    Horaud, Radu
    Forbes, Florence
    2014 IEEE INTERNATIONAL WORKSHOP ON MACHINE LEARNING FOR SIGNAL PROCESSING (MLSP), 2014,
  • [44] Tracking atoms with particles for audio-visual source localization
    Monaci, Gianluca
    Vandergheynst, Pierre
    Maggio, Emilio
    Cavallaro, Andrea
    2007 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING, VOL II, PTS 1-3, 2007, : 753 - +
  • [45] Audio-visual speaker localization using graphical models
    Kushal, Akash
    Rahurkar, Mandar
    Li Fei-Fei
    Ponce, Jean
    Huang, Thomas
    18TH INTERNATIONAL CONFERENCE ON PATTERN RECOGNITION, VOL 1, PROCEEDINGS, 2006, : 291 - +
  • [46] Audio-Visual Localization by Synthetic Acoustic Image Generation
    Sanguineti, Valentina
    Morerio, Pietro
    Del Bue, Alessio
    Murino, Vittorio
    THIRTY-FIFTH AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE, THIRTY-THIRD CONFERENCE ON INNOVATIVE APPLICATIONS OF ARTIFICIAL INTELLIGENCE AND THE ELEVENTH SYMPOSIUM ON EDUCATIONAL ADVANCES IN ARTIFICIAL INTELLIGENCE, 2021, 35 : 2523 - 2531
  • [47] Distributed audio-visual archives network (DiVAN)
    Tirakis, A
    Katalagarianos, P
    Papathomas, M
    Hamilakis, C
    IEEE INTERNATIONAL CONFERENCE ON MULTIMEDIA COMPUTING AND SYSTEMS, PROCEEDINGS VOL 2, 1999, : 1086 - 1088
  • [48] Integrated audio-visual processing for object localization and tracking
    Pingali, GS
    MULTIMEDIA COMPUTING AND NETWORKING 1998, 1997, 3310 : 206 - 213
  • [49] Onmidirectional audio-visual talker localization based on dynamic fusion of audio-visual features using validity and reliability criteria
    Denda, Yuki
    Nishiura, Takanobu
    Yamashita, Yoichi
    IEICE TRANSACTIONS ON INFORMATION AND SYSTEMS, 2008, E91D (03): : 598 - 606
  • [50] Audio-Visual Event Classification via Spatial-Temporal-Audio Words
    Cao, Yu
    Baang, Sung
    Liu, Shih-Hsi 'Alex'
    Li, Ming
    Hu, Sanqing
    19TH INTERNATIONAL CONFERENCE ON PATTERN RECOGNITION, VOLS 1-6, 2008, : 858 - +