One-class network leveraging spectro-temporal features for generalized synthetic speech detection

被引：0

作者：

Yea, Jiahong ^{[1
]}

Yan, Diqun ^{[1
,2
]}

Fu, Songyin ^{[1
]}

Mac, Bin ^{[3
]}

Xia, Zhihua ^{[4
]}

机构：

[1] Ningbo Univ, Fac Elect Engn & Comp Sci, Ningbo, Peoples R China

[2] Ningbo Univ Finance & Econ, Coll Digital Technol & Engn, Ningbo, Peoples R China

[3] Qilu Univ Technol, Shandong Acad Sci, Key Lab Comp Power Network & Informat Secur, Minist Educ, Jinan, Peoples R China

[4] Jinan Univ, Coll Cyber Secur, Guangzhou, Peoples R China

来源：

SPEECH COMMUNICATION | 2025年 / 169卷

关键词：

ASVspoof; One-class learning; Spectro-Temporal; Speech anti-spoofing;

D O I：

10.1016/j.specom.2025.103200

中图分类号：

O42 [声学];

学科分类号：

070206 ; 082403 ;

摘要：

Synthetic speech attacks pose significant threats to Automatic Speaker Verification (ASV) systems. To counter these, various detection systems have been developed. However, these models often struggle with reduced accuracy when encountering novel spoofing attacks during testing. To address this issue, this paper proposes a One-Class Network architecture that leverages features extracted from the log power spectrum of the F0 subband. We have developed an advanced spectro-temporal enhancement module, comprising the Temporal Correlation Integrate Module (TCIM) and the Frequency-Adaptive Dependency Module (FADM), to accurately capture F0 subband details. TCIM captures crucial temporal dynamics and models the long-term dependencies characteristic of the F0 signals within the F0 subband. Meanwhile, FADM employs a frequency-adaptive mechanism to identify critical frequency bands, allowing the detection system to conduct a thorough and detailed analysis. Additionally, we introduce a KLOC-Softmax loss function that incorporates the KoLeo regularizer. This function promotes a uniform distribution of features within batches, effectively addressing intra-class imbalance and aiding balanced optimization. Experimental results on the ASVspoof 2019 LA dataset show that our approach achieves an equal error rate (EER) of 0.38% and a minimum tandem detection cost function (min t-DCF) of 0.0127. Our method outperforms most state-of-the-art speech anti-spoofing techniques and demonstrates robust generalizability to previously unseen types of synthetic speech attacks.

引用

页数：10

共 50 条

[21] REPLAY-ATTACK DETECTION USING FEATURES WITH ADAPTIVE SPECTRO-TEMPORAL RESOLUTION
Liu, Meng
Wang, Longbiao
Lee, Kong Aik
Chen, Xuanda
Dang, Jianwu
2021 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP 2021), 2021, : 6374 - 6378
[22] Automated detection of broadband clicks of freshwater fish using spectro-temporal features
Kottege, Navinda
Jurdak, Raja
Kroon, Frederieke
Jones, Dean
JOURNAL OF THE ACOUSTICAL SOCIETY OF AMERICA, 2015, 137 (05): : 2502 - 2511
[23] One-Class Neural Network With Directed Statistics Pooling for Spoofing Speech Detection
Lin, Guoyuan
Luo, Weiqi
Luo, Da
Huang, Jiwu
IEEE TRANSACTIONS ON INFORMATION FORENSICS AND SECURITY, 2024, 19 : 2581 - 2593
[24] DeepComboSAD: Spectro-Temporal Correlation Based Speech Activity Detection for Naturalistic Audio Streams
Joglekar, Aditya
Hansen, John H. L.
IEEE SIGNAL PROCESSING LETTERS, 2023, 30 : 1472 - 1476
[25] NON-INTRUSIVE QUALITY ASSESSMENT FOR ENHANCED SPEECH SIGNALS BASED ON SPECTRO-TEMPORAL FEATURES
Li, Qiaohong
Fang, Yuming
Lin, Weisi
Thalmann, Daniel
2014 IEEE INTERNATIONAL CONFERENCE ON MULTIMEDIA AND EXPO WORKSHOPS (ICMEW), 2014,
[26] Nonlinear spectro-temporal features based on a cochlear model for automatic speech recognition in a noisy situation
Choi, Yong-Sun
Lee, Soo-Young
NEURAL NETWORKS, 2013, 45 : 62 - 69
[27] Separable spectro-temporal Gabor filter bank features: Reducing the complexity of robust features for automatic speech recognition
Schaedler, Marc Rene
Kollmeier, Birger
JOURNAL OF THE ACOUSTICAL SOCIETY OF AMERICA, 2015, 137 (04): : 2047 - 2059
[28] Joint Optimization of Spectro-Temporal Features and Deep Neural Nets for Robust Automatic Speech Recognition
Kovacs, Gyorgy
Toth, Laszlo
ACTA CYBERNETICA, 2015, 22 (01): : 117 - 134
[29] Deep One-Class Hate Speech Detection Model
Bose, Saugata
Su, Guoxin
LREC 2022: THIRTEEN INTERNATIONAL CONFERENCE ON LANGUAGE RESOURCES AND EVALUATION, 2022, : 7040 - 7048
[30] A single microphone noise reduction algorithm based on the detection and reconstruction of spectro-temporal features
Lee, Tyler
Theunissen, Frederic
PROCEEDINGS OF THE ROYAL SOCIETY A-MATHEMATICAL PHYSICAL AND ENGINEERING SCIENCES, 2015, 471 (2184):

← 1 2 3 4 5 →