One-class network leveraging spectro-temporal features for generalized synthetic speech detection

被引:0
|
作者
Yea, Jiahong [1 ]
Yan, Diqun [1 ,2 ]
Fu, Songyin [1 ]
Mac, Bin [3 ]
Xia, Zhihua [4 ]
机构
[1] Ningbo Univ, Fac Elect Engn & Comp Sci, Ningbo, Peoples R China
[2] Ningbo Univ Finance & Econ, Coll Digital Technol & Engn, Ningbo, Peoples R China
[3] Qilu Univ Technol, Shandong Acad Sci, Key Lab Comp Power Network & Informat Secur, Minist Educ, Jinan, Peoples R China
[4] Jinan Univ, Coll Cyber Secur, Guangzhou, Peoples R China
关键词
ASVspoof; One-class learning; Spectro-Temporal; Speech anti-spoofing;
D O I
10.1016/j.specom.2025.103200
中图分类号
O42 [声学];
学科分类号
070206 ; 082403 ;
摘要
Synthetic speech attacks pose significant threats to Automatic Speaker Verification (ASV) systems. To counter these, various detection systems have been developed. However, these models often struggle with reduced accuracy when encountering novel spoofing attacks during testing. To address this issue, this paper proposes a One-Class Network architecture that leverages features extracted from the log power spectrum of the F0 subband. We have developed an advanced spectro-temporal enhancement module, comprising the Temporal Correlation Integrate Module (TCIM) and the Frequency-Adaptive Dependency Module (FADM), to accurately capture F0 subband details. TCIM captures crucial temporal dynamics and models the long-term dependencies characteristic of the F0 signals within the F0 subband. Meanwhile, FADM employs a frequency-adaptive mechanism to identify critical frequency bands, allowing the detection system to conduct a thorough and detailed analysis. Additionally, we introduce a KLOC-Softmax loss function that incorporates the KoLeo regularizer. This function promotes a uniform distribution of features within batches, effectively addressing intra-class imbalance and aiding balanced optimization. Experimental results on the ASVspoof 2019 LA dataset show that our approach achieves an equal error rate (EER) of 0.38% and a minimum tandem detection cost function (min t-DCF) of 0.0127. Our method outperforms most state-of-the-art speech anti-spoofing techniques and demonstrates robust generalizability to previously unseen types of synthetic speech attacks.
引用
收藏
页数:10
相关论文
共 50 条
  • [21] REPLAY-ATTACK DETECTION USING FEATURES WITH ADAPTIVE SPECTRO-TEMPORAL RESOLUTION
    Liu, Meng
    Wang, Longbiao
    Lee, Kong Aik
    Chen, Xuanda
    Dang, Jianwu
    2021 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP 2021), 2021, : 6374 - 6378
  • [22] Automated detection of broadband clicks of freshwater fish using spectro-temporal features
    Kottege, Navinda
    Jurdak, Raja
    Kroon, Frederieke
    Jones, Dean
    JOURNAL OF THE ACOUSTICAL SOCIETY OF AMERICA, 2015, 137 (05): : 2502 - 2511
  • [23] One-Class Neural Network With Directed Statistics Pooling for Spoofing Speech Detection
    Lin, Guoyuan
    Luo, Weiqi
    Luo, Da
    Huang, Jiwu
    IEEE TRANSACTIONS ON INFORMATION FORENSICS AND SECURITY, 2024, 19 : 2581 - 2593
  • [24] DeepComboSAD: Spectro-Temporal Correlation Based Speech Activity Detection for Naturalistic Audio Streams
    Joglekar, Aditya
    Hansen, John H. L.
    IEEE SIGNAL PROCESSING LETTERS, 2023, 30 : 1472 - 1476
  • [25] NON-INTRUSIVE QUALITY ASSESSMENT FOR ENHANCED SPEECH SIGNALS BASED ON SPECTRO-TEMPORAL FEATURES
    Li, Qiaohong
    Fang, Yuming
    Lin, Weisi
    Thalmann, Daniel
    2014 IEEE INTERNATIONAL CONFERENCE ON MULTIMEDIA AND EXPO WORKSHOPS (ICMEW), 2014,
  • [26] Nonlinear spectro-temporal features based on a cochlear model for automatic speech recognition in a noisy situation
    Choi, Yong-Sun
    Lee, Soo-Young
    NEURAL NETWORKS, 2013, 45 : 62 - 69
  • [27] Separable spectro-temporal Gabor filter bank features: Reducing the complexity of robust features for automatic speech recognition
    Schaedler, Marc Rene
    Kollmeier, Birger
    JOURNAL OF THE ACOUSTICAL SOCIETY OF AMERICA, 2015, 137 (04): : 2047 - 2059
  • [28] Joint Optimization of Spectro-Temporal Features and Deep Neural Nets for Robust Automatic Speech Recognition
    Kovacs, Gyorgy
    Toth, Laszlo
    ACTA CYBERNETICA, 2015, 22 (01): : 117 - 134
  • [29] Deep One-Class Hate Speech Detection Model
    Bose, Saugata
    Su, Guoxin
    LREC 2022: THIRTEEN INTERNATIONAL CONFERENCE ON LANGUAGE RESOURCES AND EVALUATION, 2022, : 7040 - 7048
  • [30] A single microphone noise reduction algorithm based on the detection and reconstruction of spectro-temporal features
    Lee, Tyler
    Theunissen, Frederic
    PROCEEDINGS OF THE ROYAL SOCIETY A-MATHEMATICAL PHYSICAL AND ENGINEERING SCIENCES, 2015, 471 (2184):