Audio Multi-View Spoofing Detection Framework Based on Audio-Text-Emotion Correlations

被引:0
|
作者
Wu, Junyan [1 ]
Yin, Qilin [1 ]
Sheng, Ziqi [1 ]
Lu, Wei [1 ]
Huang, Jiwu [2 ]
Li, Bin [3 ,4 ]
机构
[1] Sun Yat sen Univ, Sch Comp Sci & Engn, Guangdong Prov Key Lab Informat Secur Technol, Minist Educ,Key Lab Informat Technol, Guangzhou 510006, Peoples R China
[2] Shenzhen MSU BIT Univ, Fac Engn, Guangdong Lab Machine Percept & Intelligent Comp, Shenzhen 518116, Peoples R China
[3] Shenzhen Univ, Guangdong Key Lab Intelligent Informat Proc, Shenzhen 518060, Peoples R China
[4] Shenzhen Univ, Shenzhen Key Lab Media Secur, Shenzhen 518060, Peoples R China
基金
中国国家自然科学基金;
关键词
Multimedia forensics; audio spoofing detection; multi-view learning; graph attention mechanism; AUTOMATIC SPEAKER VERIFICATION; SPEECH;
D O I
10.1109/TIFS.2024.3431888
中图分类号
TP301 [理论、方法];
学科分类号
081202 ;
摘要
In recent years, audio spoofing detection has received widespread attention for protecting personal privacy and social security. Despite the significant progress achieved in audio single-view spoofing detection, challenges remain with regard to addressing unknown spoofing attacks in realistic scenarios. To solve these challenging problems, in this paper, we introduce a novel audio multi-view spoofing detection framework (AMSDF), whose goal is to capture both intra-view and inter-view cues by measuring correlations within audio multi-view features (i.e., audio-emotion-text) for audio spoofing detection. In general, different view features are inherently interconnected in the real patterns, while they may present unnatural correlations in the spoofing patterns. Therefore, more discriminative cues can be mined by utilizing their complex interactions, which is beneficial to the audio spoofing detection task. To this end, an intra-view graph attention mechanism (IGAM) is first utilized to aggregate each intra-view node within the same view. Subsequently, a heterogeneous graph fusion module (HGFM) is applied to measure correlations within inter-view nodes, which are enhanced with a master node for comprehensive analysis purposes. Finally, a group-based readout scheme (GRS) is designed to capture and preserve the most distinctive cues by leveraging the strengths of different feature sets, thereby effectively distinguishing subtle differences between real and spoofing audio. The experimental results show that our proposed framework can achieve better performance than that of the state-of-the-art methods, especially in realistic scenarios. The code and pre-trained models are available at https://github.com/ItzJuny/AMSDF.
引用
收藏
页码:7133 / 7146
页数:14
相关论文
共 50 条
  • [1] Multi-view Neural Networks for Raw Audio-based Music Emotion Recognition
    He, Na
    Ferguson, Sam
    2020 IEEE INTERNATIONAL SYMPOSIUM ON MULTIMEDIA (ISM 2020), 2020, : 168 - 172
  • [2] MULTI-VIEW AUDIO AND MUSIC CLASSIFICATION
    Phan, Huy
    Le Nguyen, Huy
    Chen, Oliver Y.
    Pham, Lam
    Koch, Philipp
    McLoughlin, Ian
    Mertins, Alfred
    2021 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP 2021), 2021, : 611 - 615
  • [3] MULTI-VIEW NETWORKS FOR MULTI-CHANNEL AUDIO CLASSIFICATION
    Casebeer, Jonah
    Wang, Zhepei
    Smaragdis, Paris
    2019 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2019, : 940 - 944
  • [4] WebRTC-based Multi-View Video and Audio Transmission and Its QoE
    Maehara, Yuki
    Nunome, Toshiro
    33RD INTERNATIONAL CONFERENCE ON INFORMATION NETWORKING (ICOIN 2019), 2019, : 181 - 186
  • [5] Multi-view video and multi-channel audio broadcasting system
    Oh, Kwan-Jung
    Kim, Manbae
    Yoon, Jae Sam
    Kim, Jongryool
    Park, Ilkwon
    Lee, Seungwon
    Lee, Cheon
    Heo, Jin
    Lee, Sang-Beom
    Park, Pil-Kyu
    Na, Sang-Tae
    Hyun, Myung-Han
    Kim, JongWon
    Byun, Hyeran
    Kim, Hong Kook
    Ho, Yo-Sung
    2007 3DTV CONFERENCE, 2007, : 165 - +
  • [6] QoE Assessment of Multi-View Video and Audio IP Transmission
    Rodriguez, Erick Jimenez
    Nunome, Toshiro
    Tasaka, Shuji
    IEICE TRANSACTIONS ON COMMUNICATIONS, 2010, E93B (06) : 1373 - 1383
  • [7] A MULTI-VIEW APPROACH TO AUDIO-VISUAL SPEAKER VERIFICATION
    Sari, Leda
    Singh, Kritika
    Zhou, Jiatong
    Torresani, Lorenzo
    Singhal, Nayan
    Saraf, Yatharth
    2021 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP 2021), 2021, : 6194 - 6198
  • [8] Parameter-based Multi-channel Audio Panning for Multi-view Broadcasting Systems
    Kim, Yong Guk
    Kim, Hong Kook
    2008 SECOND INTERNATIONAL CONFERENCE ON FUTURE GENERATION COMMUNICATION AND NETWORKING SYMPOSIA, VOLS 1-5, PROCEEDINGS, 2008, : 270 - 273
  • [9] A CAPSULE NETWORK BASED APPROACH FOR DETECTION OF AUDIO SPOOFING ATTACKS
    Luo, Anwei
    Li, Enlei
    Liu, Yongliang
    Kang, Xiangui
    Wang, Z. Jane
    2021 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP 2021), 2021, : 6359 - 6363
  • [10] STATNet: Spectral and Temporal features based Multi-Task Network for Audio Spoofing Detection
    Ranjan, Rishabh
    Vatsa, Mayank
    Singh, Richa
    2022 IEEE INTERNATIONAL JOINT CONFERENCE ON BIOMETRICS (IJCB), 2022,