One-class network leveraging spectro-temporal features for generalized synthetic speech detection

被引：0

作者：

Yea, Jiahong ^{[1
]}

Yan, Diqun ^{[1
,2
]}

Fu, Songyin ^{[1
]}

Mac, Bin ^{[3
]}

Xia, Zhihua ^{[4
]}

机构：

[1] Ningbo Univ, Fac Elect Engn & Comp Sci, Ningbo, Peoples R China

[2] Ningbo Univ Finance & Econ, Coll Digital Technol & Engn, Ningbo, Peoples R China

[3] Qilu Univ Technol, Shandong Acad Sci, Key Lab Comp Power Network & Informat Secur, Minist Educ, Jinan, Peoples R China

[4] Jinan Univ, Coll Cyber Secur, Guangzhou, Peoples R China

来源：

SPEECH COMMUNICATION | 2025年 / 169卷

关键词：

ASVspoof; One-class learning; Spectro-Temporal; Speech anti-spoofing;

D O I：

10.1016/j.specom.2025.103200

中图分类号：

O42 [声学];

学科分类号：

070206 ; 082403 ;

摘要：

Synthetic speech attacks pose significant threats to Automatic Speaker Verification (ASV) systems. To counter these, various detection systems have been developed. However, these models often struggle with reduced accuracy when encountering novel spoofing attacks during testing. To address this issue, this paper proposes a One-Class Network architecture that leverages features extracted from the log power spectrum of the F0 subband. We have developed an advanced spectro-temporal enhancement module, comprising the Temporal Correlation Integrate Module (TCIM) and the Frequency-Adaptive Dependency Module (FADM), to accurately capture F0 subband details. TCIM captures crucial temporal dynamics and models the long-term dependencies characteristic of the F0 signals within the F0 subband. Meanwhile, FADM employs a frequency-adaptive mechanism to identify critical frequency bands, allowing the detection system to conduct a thorough and detailed analysis. Additionally, we introduce a KLOC-Softmax loss function that incorporates the KoLeo regularizer. This function promotes a uniform distribution of features within batches, effectively addressing intra-class imbalance and aiding balanced optimization. Experimental results on the ASVspoof 2019 LA dataset show that our approach achieves an equal error rate (EER) of 0.38% and a minimum tandem detection cost function (min t-DCF) of 0.0127. Our method outperforms most state-of-the-art speech anti-spoofing techniques and demonstrates robust generalizability to previously unseen types of synthetic speech attacks.

引用

页数：10

共 50 条

[31] Hemodialysis vascular access stenosis detection using auditory spectro-temporal features of phonoangiography
Po-Hsun Sung
Chung-Dann Kan
Wei-Ling Chen
Ling-Sheng Jang
Jhing-Fa Wang
Medical & Biological Engineering & Computing, 2015, 53 : 393 - 403
[32] Hemodialysis vascular access stenosis detection using auditory spectro-temporal features of phonoangiography
Sung, Po-Hsun
Kan, Chung-Dann
Chen, Wei-Ling
Jang, Ling-Sheng
Wang, Jhing-Fa
MEDICAL & BIOLOGICAL ENGINEERING & COMPUTING, 2015, 53 (05) : 393 - 403
[33] Long-Term Spectro-Temporal and Static Harmonic Features for Voice Activity Detection
Fukuda, Takashi
Ichikawa, Osamu
Nishimura, Masafumi
IEEE JOURNAL OF SELECTED TOPICS IN SIGNAL PROCESSING, 2010, 4 (05) : 834 - 844
[34] Spectro-temporal modulation detection and its relation to speech perception in children with auditory processing disorder
Lotfi, Younes
Moossavi, Abdollah
Afshari, Parisa Jalilzadeh
Bakhshi, Enayatollah
Sadjedi, Hamed
INTERNATIONAL JOURNAL OF PEDIATRIC OTORHINOLARYNGOLOGY, 2020, 131
[35] Spectro-temporal modulation subspace-spanning filter bank features for robust automatic speech recognition
Schaedler, Marc Rene
Meyer, Bernd T.
Kollmeier, Birger
JOURNAL OF THE ACOUSTICAL SOCIETY OF AMERICA, 2012, 131 (05): : 4134 - 4151
[36] OVERLAPPED SPEECH DETECTION USING LONG-TERM SPECTRO-TEMPORAL SIMILARITY IN STEREO RECORDING
Xiao, Bo
Ghosh, Prasanta Kumar
Georgiou, Panayiotis
Narayanan, Shrikanth S.
2011 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING, 2011, : 5216 - 5219
[37] Discriminant Sub-Space Projection of Spectro-Temporal Speech Features based on Maximizing Mutual Information
Heckmann, Martin
Glaeser, Claudius
12TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION 2011 (INTERSPEECH 2011), VOLS 1-5, 2011, : 232 - 235
[38] Normalization of spectro-temporal Gabor filter bank features for improved robust automatic speech recognition systems
Schaedler, Marc Rene
Kollmeier, Birger
13TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION 2012 (INTERSPEECH 2012), VOLS 1-3, 2012, : 1810 - 1813
[39] Spectro-Temporal Recurrent Neural Network for Robotic Slip Detection with Piezoelectric Tactile Sensor
Ayral, Theo
Aloui, Saifeddine
Grossard, Mathieu
2023 IEEE/ASME INTERNATIONAL CONFERENCE ON ADVANCED INTELLIGENT MECHATRONICS, AIM, 2023, : 573 - 578
[40] ON THE USE OF SPECTRO-TEMPORAL FEATURES FOR THE IEEE AASP CHALLENGE 'DETECTION AND CLASSIFICATION OF ACOUSTIC SCENES AND EVENTS'
Schroeder, Jens
Moritz, Niko
Schaedler, Marc Rene
Cauchi, Benjamin
Adiloglu, Kamil
Anemueller, Joern
Doclo, Simon
Kollmeier, Birger
Goetze, Stefan
2013 IEEE WORKSHOP ON APPLICATIONS OF SIGNAL PROCESSING TO AUDIO AND ACOUSTICS (WASPAA), 2013,

← 1 2 3 4 5 →