Audio Multi-View Spoofing Detection Framework Based on Audio-Text-Emotion Correlations

被引:0
|
作者
Wu, Junyan [1 ]
Yin, Qilin [1 ]
Sheng, Ziqi [1 ]
Lu, Wei [1 ]
Huang, Jiwu [2 ]
Li, Bin [3 ,4 ]
机构
[1] Sun Yat sen Univ, Sch Comp Sci & Engn, Guangdong Prov Key Lab Informat Secur Technol, Minist Educ,Key Lab Informat Technol, Guangzhou 510006, Peoples R China
[2] Shenzhen MSU BIT Univ, Fac Engn, Guangdong Lab Machine Percept & Intelligent Comp, Shenzhen 518116, Peoples R China
[3] Shenzhen Univ, Guangdong Key Lab Intelligent Informat Proc, Shenzhen 518060, Peoples R China
[4] Shenzhen Univ, Shenzhen Key Lab Media Secur, Shenzhen 518060, Peoples R China
基金
中国国家自然科学基金;
关键词
Multimedia forensics; audio spoofing detection; multi-view learning; graph attention mechanism; AUTOMATIC SPEAKER VERIFICATION; SPEECH;
D O I
10.1109/TIFS.2024.3431888
中图分类号
TP301 [理论、方法];
学科分类号
081202 ;
摘要
In recent years, audio spoofing detection has received widespread attention for protecting personal privacy and social security. Despite the significant progress achieved in audio single-view spoofing detection, challenges remain with regard to addressing unknown spoofing attacks in realistic scenarios. To solve these challenging problems, in this paper, we introduce a novel audio multi-view spoofing detection framework (AMSDF), whose goal is to capture both intra-view and inter-view cues by measuring correlations within audio multi-view features (i.e., audio-emotion-text) for audio spoofing detection. In general, different view features are inherently interconnected in the real patterns, while they may present unnatural correlations in the spoofing patterns. Therefore, more discriminative cues can be mined by utilizing their complex interactions, which is beneficial to the audio spoofing detection task. To this end, an intra-view graph attention mechanism (IGAM) is first utilized to aggregate each intra-view node within the same view. Subsequently, a heterogeneous graph fusion module (HGFM) is applied to measure correlations within inter-view nodes, which are enhanced with a master node for comprehensive analysis purposes. Finally, a group-based readout scheme (GRS) is designed to capture and preserve the most distinctive cues by leveraging the strengths of different feature sets, thereby effectively distinguishing subtle differences between real and spoofing audio. The experimental results show that our proposed framework can achieve better performance than that of the state-of-the-art methods, especially in realistic scenarios. The code and pre-trained models are available at https://github.com/ItzJuny/AMSDF.
引用
收藏
页码:7133 / 7146
页数:14
相关论文
共 50 条
  • [31] Hybrid Approach for Emotion Classification of Audio Conversation Based on Text and Speech Mining
    Bhaskar, Jasmine
    Sruthi, K.
    Nedungadi, Prema
    PROCEEDINGS OF THE INTERNATIONAL CONFERENCE ON INFORMATION AND COMMUNICATION TECHNOLOGIES, ICICT 2014, 2015, 46 : 635 - 643
  • [32] Apk2Audio4AndMal: Audio Based Malware Family Detection Framework
    Kural, Oguz Emre
    Kilic, Erdal
    Aksac, Ceyda
    IEEE ACCESS, 2023, 11 : 27527 - 27535
  • [33] Multi-modal emotion recognition in conversation based on prompt learning with text-audio fusion features
    Yuezhou Wu
    Siling Zhang
    Pengfei Li
    Scientific Reports, 15 (1)
  • [34] Detection of Hot Topics Using Multi-view Text Clustering
    Fraj, Maha
    Ben Hajkacem, Mohamed Aymen
    Essoussi, Nadia
    INFORMATION INTEGRATION AND WEB INTELLIGENCE, IIWAS 2022, 2022, 13635 : 548 - 558
  • [35] Acoustic features analysis for explainable machine learning-based audio spoofing detection
    Bisogni, Carmen
    Loia, Vincenzo
    Nappi, Michele
    Pero, Chiara
    COMPUTER VISION AND IMAGE UNDERSTANDING, 2024, 249
  • [36] An HMM-Based Multi-view Co-training Framework for Single-View Text Corpora
    Lorenzo Iglesias, Eva
    Seara Vieira, Adrian
    Borrajo Diz, Lourdes
    HYBRID ARTIFICIAL INTELLIGENT SYSTEMS, 2016, 9648 : 66 - 78
  • [37] Delaunay triangulation based text detection from multi-view images of natural scene
    Roy, Soumyadip
    Shivakumara, Palaiahnakote
    Pal, Umapada
    Lu, Tong
    Kumar, Govindaraj Hemantha
    PATTERN RECOGNITION LETTERS, 2020, 129 (129) : 92 - 100
  • [38] Impact of autoencoder based compact representation on emotion detection from audio
    Patel, Nivedita
    Patel, Shireen
    Mankad, Sapan H.
    JOURNAL OF AMBIENT INTELLIGENCE AND HUMANIZED COMPUTING, 2022, 13 (02) : 867 - 885
  • [39] Impact of autoencoder based compact representation on emotion detection from audio
    Nivedita Patel
    Shireen Patel
    Sapan H. Mankad
    Journal of Ambient Intelligence and Humanized Computing, 2022, 13 : 867 - 885
  • [40] Implementation of Multi-modal Speech Emotion Recognition Using Text Data and Audio Signals
    Adesola, Falade
    Adeyinka, Omirinlewo
    Kayode, Akindeji
    Ayodele, Adebiyi
    2023 International Conference on Science, Engineering and Business for Sustainable Development Goals, SEB-SDG 2023, 2023,