Synthetic Speech Detection Based on the Temporal Consistency of Speaker Features

被引:1
|
作者
Zhang, Yuxiang [1 ,2 ]
Li, Zhuo [1 ,2 ]
Lu, Jingze [1 ,2 ]
Wang, Wenchao [1 ,2 ]
Zhang, Pengyuan [1 ,2 ]
机构
[1] Chinese Acad Sci, Inst Acoust, Key Lab Speech Acoust & Content Understanding, Beijing 100190, Peoples R China
[2] Univ Chinese Acad Sci, Beijing 100049, Peoples R China
关键词
Feature extraction; Speech synthesis; Signal processing algorithms; Training; Robustness; Partitioning algorithms; Task analysis; Anti-spoofing; interpretability; pre-trained system; robustness; speaker verification; VERIFICATION;
D O I
10.1109/LSP.2024.3381890
中图分类号
TM [电工技术]; TN [电子技术、通信技术];
学科分类号
0808 ; 0809 ;
摘要
Current synthetic speech detection (SSD) methods perform well on specific datasets but require improvement in interpretability and robustness. One possible reason is the lack of interpretability analysis of synthetic speech defects. In this paper, the flaws in the temporal consistency (TC) of speaker features inherent in the speech synthesis process are analyzed. Differences in the TC of intra-utterance speaker features arise due to limited control over speaker features during speech synthesis. The speech generated by text-to-speech algorithms exhibits higher TC, while the speech generated by voice conversion algorithms yields slightly lower TC compared to bonafide speech. Based on this finding, a new SSD method based on the TC of speaker features is proposed. Modeling the TC of intra-utterance speaker features extracted by a pre-trained ASV system can be used for SSD. The proposed method achieves equal error rates of 0.84%, 3.93%, 12.98% and 24.66% on the ASVspoof 2019 LA, 2021 LA, 2021 DF and IntheWild evaluation datasets, respectively, demonstrating strong interpretability and robustness.
引用
收藏
页码:944 / 948
页数:5
相关论文
共 50 条
  • [1] SPEAKER IDENTIFICATION UTILIZING SELECTED TEMPORAL SPEECH FEATURES
    JOHNSON, CC
    HOLLIEN, H
    HICKS, JW
    [J]. JOURNAL OF PHONETICS, 1984, 12 (04) : 319 - 326
  • [2] Avoiding dominance of speaker features in speech-based depression detection
    Zuo, Lishi
    Mak, Man-Wai
    [J]. PATTERN RECOGNITION LETTERS, 2023, 173 : 50 - 56
  • [3] Evaluation of Speaker Verification Security and Detection of HMM-Based Synthetic Speech
    De Leon, Phillip L.
    Pucher, Michael
    Yamagishi, Junichi
    Hernaez, Inma
    Saratxaga, Ibon
    [J]. IEEE TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING, 2012, 20 (08): : 2280 - 2290
  • [4] Simultaneous Speech Detection With Spatial Features for Speaker Diarization
    Zelenak, Martin
    Segura, Carlos
    Luque, Jordi
    Hernando, Javier
    [J]. IEEE TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING, 2012, 20 (02): : 436 - 446
  • [5] Spectral Features for Synthetic Speech Detection
    Paul, Dipjyoti
    Pal, Monisankha
    Saha, Goutam
    [J]. IEEE JOURNAL OF SELECTED TOPICS IN SIGNAL PROCESSING, 2017, 11 (04) : 605 - 617
  • [6] A Comparison of Features for Synthetic Speech Detection
    Sahidullah, Md
    Kinnunen, Tomi
    Hanilci, Cenral
    [J]. 16TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION (INTERSPEECH 2015), VOLS 1-5, 2015, : 2087 - 2091
  • [7] The Detection of Overlapping Speech with Prosodic Features for Speaker Diarization
    Zelenak, Martin
    Hernando, Javier
    [J]. 12TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION 2011 (INTERSPEECH 2011), VOLS 1-5, 2011, : 1048 - 1051
  • [8] Speaker-invariant suprasegmental temporal features in normal and disguised speech
    Leemann, Adrian
    Kolly, Marie-Jose
    [J]. SPEECH COMMUNICATION, 2015, 75 : 97 - 122
  • [9] Excitation Features of Speech for Speaker-Specific Emotion Detection
    Kadiri, Sudarsana Reddy
    Alku, Paavo
    [J]. IEEE ACCESS, 2020, 8 (08): : 60382 - 60391
  • [10] Significance of Subband Features for Synthetic Speech Detection
    Yang, Jichen
    Das Rohan, Kumar
    Li, Haizhou
    [J]. IEEE TRANSACTIONS ON INFORMATION FORENSICS AND SECURITY, 2020, 15 : 2160 - 2170