One-class network leveraging spectro-temporal features for generalized synthetic speech detection

被引：0

作者：

Yea, Jiahong ^{[1
]}

Yan, Diqun ^{[1
,2
]}

Fu, Songyin ^{[1
]}

Mac, Bin ^{[3
]}

Xia, Zhihua ^{[4
]}

机构：

[1] Ningbo Univ, Fac Elect Engn & Comp Sci, Ningbo, Peoples R China

[2] Ningbo Univ Finance & Econ, Coll Digital Technol & Engn, Ningbo, Peoples R China

[3] Qilu Univ Technol, Shandong Acad Sci, Key Lab Comp Power Network & Informat Secur, Minist Educ, Jinan, Peoples R China

[4] Jinan Univ, Coll Cyber Secur, Guangzhou, Peoples R China

来源：

SPEECH COMMUNICATION | 2025年 / 169卷

关键词：

ASVspoof; One-class learning; Spectro-Temporal; Speech anti-spoofing;

D O I：

10.1016/j.specom.2025.103200

中图分类号：

O42 [声学];

学科分类号：

070206 ; 082403 ;

摘要：

Synthetic speech attacks pose significant threats to Automatic Speaker Verification (ASV) systems. To counter these, various detection systems have been developed. However, these models often struggle with reduced accuracy when encountering novel spoofing attacks during testing. To address this issue, this paper proposes a One-Class Network architecture that leverages features extracted from the log power spectrum of the F0 subband. We have developed an advanced spectro-temporal enhancement module, comprising the Temporal Correlation Integrate Module (TCIM) and the Frequency-Adaptive Dependency Module (FADM), to accurately capture F0 subband details. TCIM captures crucial temporal dynamics and models the long-term dependencies characteristic of the F0 signals within the F0 subband. Meanwhile, FADM employs a frequency-adaptive mechanism to identify critical frequency bands, allowing the detection system to conduct a thorough and detailed analysis. Additionally, we introduce a KLOC-Softmax loss function that incorporates the KoLeo regularizer. This function promotes a uniform distribution of features within batches, effectively addressing intra-class imbalance and aiding balanced optimization. Experimental results on the ASVspoof 2019 LA dataset show that our approach achieves an equal error rate (EER) of 0.38% and a minimum tandem detection cost function (min t-DCF) of 0.0127. Our method outperforms most state-of-the-art speech anti-spoofing techniques and demonstrates robust generalizability to previously unseen types of synthetic speech attacks.

引用

页数：10

共 50 条

[1] Development of spectro-temporal features of speech in children
Gautam S.
Singh L.
Gautam, Sumanlata (suman.gautam82@gmail.com), 1600, Springer Science and Business Media, LLC (20): : 543 - 551
[2] Monaural speech intelligibility and detection in maskers with varying amounts of spectro-temporal speech features
Schubotz, Wiebke
Brand, Thomas
Kollmeier, Birger
Ewert, Stephan D.
JOURNAL OF THE ACOUSTICAL SOCIETY OF AMERICA, 2016, 140 (01): : 524 - 540
[3] Hierarchical spectro-temporal features for robust speech recognition
Domont, Xavier
Heckmann, Martin
Joublin, Frank
Goerick, Christian
2008 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING, VOLS 1-12, 2008, : 4417 - 4420
[4] Spectro-Temporal Features for Howling Frequency Detection
Lee, Jae-Won
Choi, Seung Ho
COMPUTER APPLICATIONS FOR WEB, HUMAN COMPUTER INTERACTION, SIGNAL AND IMAGE PROCESSING AND PATTERN RECOGNITION, 2012, 342 : 25 - +
[5] Spectro-Temporal Sparsity Characterization for Dysarthric Speech Detection
Kodrasi, Ina
Bourlard, Herve
IEEE-ACM TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING, 2020, 28 : 1210 - 1222
[6] Spectro-Temporal Deep Features for Disordered Speech Assessment and Recognition
Geng, Mengzhe
Liu, Shansong
Yu, Jianwei
Xie, Xurong
Hu, Shoukang
Ye, Zi
Jin, Zengrui
Liu, Xunying
Meng, Helen
INTERSPEECH 2021, 2021, : 4793 - 4797
[7] Spectro-Temporal Directional Derivative Features for Automatic Speech Recognition
Gibson, James
Van Segbroeck, Maarten
Ortega, Antonio
Georgiou, Panayiotis
Narayanan, Shrikanth
14TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION (INTERSPEECH 2013), VOLS 1-5, 2013, : 872 - 875
[8] Multi-Stream Spectro-Temporal Features for Robust Speech Recognition
Zhao, Sherry Y.
Morgan, Nelson
INTERSPEECH 2008: 9TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION 2008, VOLS 1-5, 2008, : 898 - 901
[9] Spectro-Temporal Gabor Filterbank Features for Acoustic Event Detection
Schroeder, Jens
Goetze, Stefan
Anemueller, Joern
IEEE-ACM TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING, 2015, 23 (12) : 2198 - 2208
[10] Hilbert Envelope Based Spectro-Temporal Features for Phoneme Recognition in Telephone Speech
Thomas, Samuel
Ganapathy, Sriram
Hermansky, Hynek
INTERSPEECH 2008: 9TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION 2008, VOLS 1-5, 2008, : 1521 - +

← 1 2 3 4 5 →