Learning weakly supervised audio-visual violence detection in hyperbolic space

被引:0
|
作者
Zhou, Xiao [1 ]
Peng, Xiaogang [1 ]
Wen, Hao [2 ]
Luo, Yikai [1 ]
Yu, Keyang [1 ]
Yang, Ping [1 ]
Wu, Zizhao [1 ]
机构
[1] Hangzhou Dianzi Univ, Sch Digital Media & Technol, Hangzhou, Peoples R China
[2] Natl Univ Def Technol, Coll Elect Sci & Technol, Changsha, Peoples R China
关键词
Weakly supervised learning; Hyperbolic space; Video violence detection;
D O I
10.1016/j.imavis.2024.105286
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
In recent years, the task of weakly supervised audio-visual violence detection has gained considerable attention. The goal of this task is to identify violent segments within multimodal data based on video-level labels. Despite advances in this field, traditional Euclidean neural networks, which have been used in prior research, encounter difficulties in capturing highly discriminative representations due to limitations of the feature space. To overcome this, we propose HyperVD, , a novel framework that learns snippet embeddings in hyperbolic space to improve model discrimination. We contribute two branches of fully hyperbolic graph convolutional networks that excavate feature similarities and temporal relationships among snippets in hyperbolic space. By learning snippet representations in this space, the framework effectively learns semantic discrepancies between violent snippets and normal ones. Extensive experiments on the XD-Violence benchmark demonstrate that our method achieves 85.67% AP, outperforming the state-of-the-art methods by a sizable margin.
引用
收藏
页数:10
相关论文
共 50 条
  • [31] Multi-modal Grouping Network for Weakly-Supervised Audio-Visual Video Parsing
    Mo, Shentong
    Tian, Yapeng
    ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 35 (NEURIPS 2022), 2022,
  • [32] Audio-visual event detection based on mining of semantic audio-visual labels
    Goh, KS
    Miyahara, K
    Radhakrishan, R
    Xiong, ZY
    Divakaran, A
    STORAGE AND RETRIEVAL METHODS AND APPLICATIONS FOR MULTIMEDIA 2004, 2004, 5307 : 292 - 299
  • [33] Audio-Visual Contrastive and Consistency Learning for Semi-Supervised Action Recognition
    Assefa, Maregu
    Jiang, Wei
    Zhan, Jinyu
    Gedamu, Kumie
    Yilma, Getinet
    Ayalew, Melese
    Adhikari, Deepak
    IEEE TRANSACTIONS ON MULTIMEDIA, 2024, 26 : 3491 - 3504
  • [34] Self-Supervised Audio-Visual Representation Learning for in-the-wild Videos
    Feng, Zishun
    Tu, Ming
    Xia, Rui
    Wang, Yuxuan
    Krishnamurthy, Ashok
    2020 IEEE INTERNATIONAL CONFERENCE ON BIG DATA (BIG DATA), 2020, : 5671 - 5672
  • [35] SCLAV: Supervised Cross-modal Contrastive Learning for Audio-Visual Coding
    Sun, Chao
    Chen, Min
    Cheng, Jialiang
    Liang, Han
    Zhu, Chuanbo
    Chen, Jincai
    PROCEEDINGS OF THE 31ST ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA, MM 2023, 2023, : 261 - 270
  • [36] Self-Supervised Moving Vehicle Detection From Audio-Visual Cues
    Zuern, Jannik
    Burgard, Wolfram
    IEEE ROBOTICS AND AUTOMATION LETTERS, 2022, 7 (03) : 7415 - 7422
  • [37] Self-Supervised Learning for Audio-Visual Relationships of Videos With Stereo Sounds
    Sato, Tomoya
    Sugano, Yusuke
    Sato, Yoichi
    IEEE ACCESS, 2022, 10 : 94273 - 94284
  • [38] Audio-visual deepfake detection using articulatory representation learning
    Wang, Yujia
    Huang, Hua
    COMPUTER VISION AND IMAGE UNDERSTANDING, 2024, 248
  • [39] Audio-visual representation learning for anomaly events detection in crowds
    Gao, Junyu
    Yang, Hao
    Gong, Maoguo
    Li, Xuelong
    NEUROCOMPUTING, 2024, 582
  • [40] A Robust Audio-visual Speech Recognition Using Audio-visual Voice Activity Detection
    Tamura, Satoshi
    Ishikawa, Masato
    Hashiba, Takashi
    Takeuchi, Shin'ichi
    Hayamizu, Satoru
    11TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION 2010 (INTERSPEECH 2010), VOLS 3 AND 4, 2010, : 2702 - +