One-class network leveraging spectro-temporal features for generalized synthetic speech detection

被引:0
|
作者
Yea, Jiahong [1 ]
Yan, Diqun [1 ,2 ]
Fu, Songyin [1 ]
Mac, Bin [3 ]
Xia, Zhihua [4 ]
机构
[1] Ningbo Univ, Fac Elect Engn & Comp Sci, Ningbo, Peoples R China
[2] Ningbo Univ Finance & Econ, Coll Digital Technol & Engn, Ningbo, Peoples R China
[3] Qilu Univ Technol, Shandong Acad Sci, Key Lab Comp Power Network & Informat Secur, Minist Educ, Jinan, Peoples R China
[4] Jinan Univ, Coll Cyber Secur, Guangzhou, Peoples R China
关键词
ASVspoof; One-class learning; Spectro-Temporal; Speech anti-spoofing;
D O I
10.1016/j.specom.2025.103200
中图分类号
O42 [声学];
学科分类号
070206 ; 082403 ;
摘要
Synthetic speech attacks pose significant threats to Automatic Speaker Verification (ASV) systems. To counter these, various detection systems have been developed. However, these models often struggle with reduced accuracy when encountering novel spoofing attacks during testing. To address this issue, this paper proposes a One-Class Network architecture that leverages features extracted from the log power spectrum of the F0 subband. We have developed an advanced spectro-temporal enhancement module, comprising the Temporal Correlation Integrate Module (TCIM) and the Frequency-Adaptive Dependency Module (FADM), to accurately capture F0 subband details. TCIM captures crucial temporal dynamics and models the long-term dependencies characteristic of the F0 signals within the F0 subband. Meanwhile, FADM employs a frequency-adaptive mechanism to identify critical frequency bands, allowing the detection system to conduct a thorough and detailed analysis. Additionally, we introduce a KLOC-Softmax loss function that incorporates the KoLeo regularizer. This function promotes a uniform distribution of features within batches, effectively addressing intra-class imbalance and aiding balanced optimization. Experimental results on the ASVspoof 2019 LA dataset show that our approach achieves an equal error rate (EER) of 0.38% and a minimum tandem detection cost function (min t-DCF) of 0.0127. Our method outperforms most state-of-the-art speech anti-spoofing techniques and demonstrates robust generalizability to previously unseen types of synthetic speech attacks.
引用
收藏
页数:10
相关论文
共 50 条
  • [31] Hemodialysis vascular access stenosis detection using auditory spectro-temporal features of phonoangiography
    Po-Hsun Sung
    Chung-Dann Kan
    Wei-Ling Chen
    Ling-Sheng Jang
    Jhing-Fa Wang
    Medical & Biological Engineering & Computing, 2015, 53 : 393 - 403
  • [32] Hemodialysis vascular access stenosis detection using auditory spectro-temporal features of phonoangiography
    Sung, Po-Hsun
    Kan, Chung-Dann
    Chen, Wei-Ling
    Jang, Ling-Sheng
    Wang, Jhing-Fa
    MEDICAL & BIOLOGICAL ENGINEERING & COMPUTING, 2015, 53 (05) : 393 - 403
  • [33] Long-Term Spectro-Temporal and Static Harmonic Features for Voice Activity Detection
    Fukuda, Takashi
    Ichikawa, Osamu
    Nishimura, Masafumi
    IEEE JOURNAL OF SELECTED TOPICS IN SIGNAL PROCESSING, 2010, 4 (05) : 834 - 844
  • [34] Spectro-temporal modulation detection and its relation to speech perception in children with auditory processing disorder
    Lotfi, Younes
    Moossavi, Abdollah
    Afshari, Parisa Jalilzadeh
    Bakhshi, Enayatollah
    Sadjedi, Hamed
    INTERNATIONAL JOURNAL OF PEDIATRIC OTORHINOLARYNGOLOGY, 2020, 131
  • [35] Spectro-temporal modulation subspace-spanning filter bank features for robust automatic speech recognition
    Schaedler, Marc Rene
    Meyer, Bernd T.
    Kollmeier, Birger
    JOURNAL OF THE ACOUSTICAL SOCIETY OF AMERICA, 2012, 131 (05): : 4134 - 4151
  • [36] OVERLAPPED SPEECH DETECTION USING LONG-TERM SPECTRO-TEMPORAL SIMILARITY IN STEREO RECORDING
    Xiao, Bo
    Ghosh, Prasanta Kumar
    Georgiou, Panayiotis
    Narayanan, Shrikanth S.
    2011 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING, 2011, : 5216 - 5219
  • [37] Discriminant Sub-Space Projection of Spectro-Temporal Speech Features based on Maximizing Mutual Information
    Heckmann, Martin
    Glaeser, Claudius
    12TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION 2011 (INTERSPEECH 2011), VOLS 1-5, 2011, : 232 - 235
  • [38] Normalization of spectro-temporal Gabor filter bank features for improved robust automatic speech recognition systems
    Schaedler, Marc Rene
    Kollmeier, Birger
    13TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION 2012 (INTERSPEECH 2012), VOLS 1-3, 2012, : 1810 - 1813
  • [39] Spectro-Temporal Recurrent Neural Network for Robotic Slip Detection with Piezoelectric Tactile Sensor
    Ayral, Theo
    Aloui, Saifeddine
    Grossard, Mathieu
    2023 IEEE/ASME INTERNATIONAL CONFERENCE ON ADVANCED INTELLIGENT MECHATRONICS, AIM, 2023, : 573 - 578
  • [40] ON THE USE OF SPECTRO-TEMPORAL FEATURES FOR THE IEEE AASP CHALLENGE 'DETECTION AND CLASSIFICATION OF ACOUSTIC SCENES AND EVENTS'
    Schroeder, Jens
    Moritz, Niko
    Schaedler, Marc Rene
    Cauchi, Benjamin
    Adiloglu, Kamil
    Anemueller, Joern
    Doclo, Simon
    Kollmeier, Birger
    Goetze, Stefan
    2013 IEEE WORKSHOP ON APPLICATIONS OF SIGNAL PROCESSING TO AUDIO AND ACOUSTICS (WASPAA), 2013,