Segment-level event perception with semantic dictionary for weakly supervised audio-visual video parsing

被引:0
|
作者
Xie, Zhuyang [1 ,2 ]
Yang, Yan [1 ,2 ,3 ]
Yu, Yankai [1 ,2 ]
Wang, Jie [1 ,2 ]
Liu, Yan [1 ,2 ]
Jiang, Yongquan [1 ,2 ,3 ]
机构
[1] Southwest Jiaotong Univ, Sch Comp & Artificial Intelligence, Chengdu 611756, Peoples R China
[2] Minist Educ, Engn Res Ctr Sustainable Urban Intelligent Transpo, Beijing, Peoples R China
[3] Natl Engn Lab Integrated Transportat Big Data Appl, Chengdu 611756, Peoples R China
关键词
Audio-visual video parsing; Weakly supervised; Multimodal learning; Pseudo label; ACTION RECOGNITION;
D O I
10.1016/j.knosys.2024.112884
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Videos capture auditory and visual signals, each conveying distinct events. Simultaneously analyzing these multimodal signals enhances human comprehension of the video content. We focus on the audio-visual video parsing task, in which we integrate auditory and visual cues to identify events in each modality and pinpoint their temporal boundaries. Since fine-grained segment-level annotation is labor-intensive and time-consuming, only video-level labels are provided during the training phase. Labels and timestamps for each modality are unknown. A prevalent strategy is to aggregate audio and visual features through cross-modal attention and further denoise video labels to parse events within video segments in a weakly supervised manner. However, these denoised labels have limitations: they are restricted to the video level, and segment-level annotations remain unknown. In this paper, we propose a semantic dictionary description method for audio-visual video parsing, termed SDDP (Semantic Dictionary Description for video Parsing), which uses a semantic dictionary to explicitly represent the content of video segments. In particular, we query the relevance of each segment with semantic words from the dictionary and determine the pertinent semantic words to redescribe each segment. These redescribed segments encode event-related information, facilitating cross-modal video parsing. Furthermore, a pseudo label generation strategy is introduced to convert the relevance of semantic dictionary queries into segment-level pseudo labels, which provide segment-level event information to supervise event prediction. Extensive experiments demonstrate the effectiveness of the proposed method, achieving superior performance compared with state-of-the-art methods.
引用
收藏
页数:14
相关论文
共 50 条
  • [31] MM-Pyramid: Multimodal Pyramid Attentional Network for Audio-Visual Event Localization and Video Parsing
    Yu, Jiashuo
    Cheng, Ying
    Zhao, Rui-Wei
    Feng, Rui
    Zhang, Yuejie
    PROCEEDINGS OF THE 30TH ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA, MM 2022, 2022, : 6241 - 6249
  • [32] Audio-visual event recognition in surveillance video sequences
    Cristani, Marco
    Bicego, Manuele
    Murino, Vittorio
    IEEE TRANSACTIONS ON MULTIMEDIA, 2007, 9 (02) : 257 - 267
  • [33] Self-Supervised Video Representation and Temporally Adaptive Attention for Audio-Visual Event Localization
    Ran, Yue
    Tang, Hongying
    Li, Baoqing
    Wang, Guohui
    APPLIED SCIENCES-BASEL, 2022, 12 (24):
  • [34] A Closer Look at Weakly-Supervised Audio-Visual Source Localization
    Mo, Shentong
    Morgado, Pedro
    ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 35 (NEURIPS 2022), 2022,
  • [35] Learning weakly supervised audio-visual violence detection in hyperbolic space
    Zhou, Xiao
    Peng, Xiaogang
    Wen, Hao
    Luo, Yikai
    Yu, Keyang
    Yang, Ping
    Wu, Zizhao
    IMAGE AND VISION COMPUTING, 2024, 151
  • [36] Event-related fMRI of audio-visual simultaneity perception
    Raizada, R
    Poldrack, R
    JOURNAL OF COGNITIVE NEUROSCIENCE, 2002, : 172 - 172
  • [37] Discovering joint audio-visual codewords for video event detection
    Jhuo, I-Hong
    Ye, Guangnan
    Gao, Shenghua
    Liu, Dong
    Jiang, Yu-Gang
    Lee, D. T.
    Chang, Shih-Fu
    MACHINE VISION AND APPLICATIONS, 2014, 25 (01) : 33 - 47
  • [38] AUDIO-VISUAL PERCEPTION OF OMNIDIRECTIONAL VIDEO FOR VIRTUAL REALITY APPLICATIONS
    Chao, Fang-Yi
    Ozcinar, Cagri
    Wang, Chen
    Zerman, Emin
    Zhang, Lu
    Hamidouche, Wassim
    Deforges, Olivier
    Smolic, Aljosa
    2020 IEEE INTERNATIONAL CONFERENCE ON MULTIMEDIA AND EXPO WORKSHOPS (ICMEW), 2020,
  • [39] Content-based video parsing and indexing based on audio-visual interaction
    Tsekeridou, S
    Pitas, I
    IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, 2001, 11 (04) : 522 - 535
  • [40] Self-Supervised Video Forensics by Audio-Visual Anomaly Detection
    Feng, Chao
    Chen, Ziyang
    Owens, Andrew
    2023 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2023, : 10491 - 10503