Learning weakly supervised audio-visual violence detection in hyperbolic space

被引：0

作者：

Zhou, Xiao ^{[1
]}

Peng, Xiaogang ^{[1
]}

Wen, Hao ^{[2
]}

Luo, Yikai ^{[1
]}

Yu, Keyang ^{[1
]}

Yang, Ping ^{[1
]}

Wu, Zizhao ^{[1
]}

机构：

[1] Hangzhou Dianzi Univ, Sch Digital Media & Technol, Hangzhou, Peoples R China

[2] Natl Univ Def Technol, Coll Elect Sci & Technol, Changsha, Peoples R China

来源：

IMAGE AND VISION COMPUTING | 2024年 / 151卷

关键词：

Weakly supervised learning; Hyperbolic space; Video violence detection;

D O I：

10.1016/j.imavis.2024.105286

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

In recent years, the task of weakly supervised audio-visual violence detection has gained considerable attention. The goal of this task is to identify violent segments within multimodal data based on video-level labels. Despite advances in this field, traditional Euclidean neural networks, which have been used in prior research, encounter difficulties in capturing highly discriminative representations due to limitations of the feature space. To overcome this, we propose HyperVD, , a novel framework that learns snippet embeddings in hyperbolic space to improve model discrimination. We contribute two branches of fully hyperbolic graph convolutional networks that excavate feature similarities and temporal relationships among snippets in hyperbolic space. By learning snippet representations in this space, the framework effectively learns semantic discrepancies between violent snippets and normal ones. Extensive experiments on the XD-Violence benchmark demonstrate that our method achieves 85.67% AP, outperforming the state-of-the-art methods by a sizable margin.

引用

页数：10

共 50 条

[31] Multi-modal Grouping Network for Weakly-Supervised Audio-Visual Video Parsing
Mo, Shentong
Tian, Yapeng
ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 35 (NEURIPS 2022), 2022,
[32] Audio-visual event detection based on mining of semantic audio-visual labels
Goh, KS
Miyahara, K
Radhakrishan, R
Xiong, ZY
Divakaran, A
STORAGE AND RETRIEVAL METHODS AND APPLICATIONS FOR MULTIMEDIA 2004, 2004, 5307 : 292 - 299
[33] Audio-Visual Contrastive and Consistency Learning for Semi-Supervised Action Recognition
Assefa, Maregu
Jiang, Wei
Zhan, Jinyu
Gedamu, Kumie
Yilma, Getinet
Ayalew, Melese
Adhikari, Deepak
IEEE TRANSACTIONS ON MULTIMEDIA, 2024, 26 : 3491 - 3504
[34] Self-Supervised Audio-Visual Representation Learning for in-the-wild Videos
Feng, Zishun
Tu, Ming
Xia, Rui
Wang, Yuxuan
Krishnamurthy, Ashok
2020 IEEE INTERNATIONAL CONFERENCE ON BIG DATA (BIG DATA), 2020, : 5671 - 5672
[35] SCLAV: Supervised Cross-modal Contrastive Learning for Audio-Visual Coding
Sun, Chao
Chen, Min
Cheng, Jialiang
Liang, Han
Zhu, Chuanbo
Chen, Jincai
PROCEEDINGS OF THE 31ST ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA, MM 2023, 2023, : 261 - 270
[36] Self-Supervised Moving Vehicle Detection From Audio-Visual Cues
Zuern, Jannik
Burgard, Wolfram
IEEE ROBOTICS AND AUTOMATION LETTERS, 2022, 7 (03) : 7415 - 7422
[37] Self-Supervised Learning for Audio-Visual Relationships of Videos With Stereo Sounds
Sato, Tomoya
Sugano, Yusuke
Sato, Yoichi
IEEE ACCESS, 2022, 10 : 94273 - 94284
[38] Audio-visual deepfake detection using articulatory representation learning
Wang, Yujia
Huang, Hua
COMPUTER VISION AND IMAGE UNDERSTANDING, 2024, 248
[39] Audio-visual representation learning for anomaly events detection in crowds
Gao, Junyu
Yang, Hao
Gong, Maoguo
Li, Xuelong
NEUROCOMPUTING, 2024, 582
[40] A Robust Audio-visual Speech Recognition Using Audio-visual Voice Activity Detection
Tamura, Satoshi
Ishikawa, Masato
Hashiba, Takashi
Takeuchi, Shin'ichi
Hayamizu, Satoru
11TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION 2010 (INTERSPEECH 2010), VOLS 3 AND 4, 2010, : 2702 - +

← 1 2 3 4 5 →