Audio Spectrogram Transformer for Synthetic Speech Detection via Speech Formant Analysis

被引:4
|
作者
Cuccovillo, Luca [1 ]
Gerhardt, Milica [1 ]
Aichroth, Patrick [1 ]
机构
[1] Fraunhofer Inst Digital Media Technol IDMT, Ehrenbergstr 31, Ilmenau, Germany
关键词
synthetic speech detection; audio deepfakes; spectrogram transformer; voice formants;
D O I
10.1109/WIFS58808.2023.10374615
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
In this paper, we address the challenge of synthetic speech detection, which has become increasingly important due to the latest advancements in text-to-speech and voice conversion technologies. We propose a novel multi-task neural network architecture, designed to be interpretable and specifically tailored for audio signals. The architecture includes a feature bottleneck, used to autoencode the input spectrogram, predict the fundamental frequency (f0) trajectory, and classify the speech as synthetic or natural. Hence, the synthesis detection can be considered a byproduct of attending to the energy distribution among vocal formants, providing a clear understanding of which characteristics of the input signal influence the final outcome. Our evaluation on the ASVspoof 2019 LA partition indicates better performance than the current state of the art, with an AUC score of 0.900.
引用
收藏
页数:6
相关论文
共 50 条
  • [41] Classifiers for Synthetic Speech Detection: A Comparison
    Hanilci, Cemal
    Kinnunen, Tomi
    Sahidullah, Md
    Sizov, Aleksandr
    16TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION (INTERSPEECH 2015), VOLS 1-5, 2015, : 2057 - 2061
  • [42] DETECTION OF SYNTHETIC SPEECH FOR THE PROBLEM OF IMPOSTURE
    De Leon, Phillip L.
    Hernaez, Inma
    Saratxaga, Ibon
    Pucher, Michael
    Yamagishi, Junichi
    2011 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING, 2011, : 4844 - 4847
  • [43] Spectral Features for Synthetic Speech Detection
    Paul, Dipjyoti
    Pal, Monisankha
    Saha, Goutam
    IEEE JOURNAL OF SELECTED TOPICS IN SIGNAL PROCESSING, 2017, 11 (04) : 605 - 617
  • [44] A Comparison of Features for Synthetic Speech Detection
    Sahidullah, Md
    Kinnunen, Tomi
    Hanilci, Cenral
    16TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION (INTERSPEECH 2015), VOLS 1-5, 2015, : 2087 - 2091
  • [45] Open Challenges in Synthetic Speech Detection
    Cuccovillo, Luca
    Papastergiopoulos, Christoforos
    Vafeiadis, Anastasios
    Yaroshchuk, Artem
    Aichroth, Patrick
    Votis, Konstantinos
    Tzovaras, Dimitrios
    2022 IEEE INTERNATIONAL WORKSHOP ON INFORMATION FORENSICS AND SECURITY (WIFS), 2022,
  • [46] Speech Formants Integration for Generalized Detection of Synthetic Speech Spoofing Attacks
    Liu, Kexu
    Wang, Yuanxin
    Lie, Shengchen
    Shao, Xi
    INTERSPEECH 2024, 2024, : 2100 - 2104
  • [47] Experimental Analysis and Selection of Spectrogram Features for Speech Emotion Recognition
    Tang, Gui-Chen
    Liang, Rui-Yu
    Feng, Yue-Qin
    Wang, Qing-Yun
    INTERNATIONAL CONFERENCE ON MECHANICS, BUILDING MATERIAL AND CIVIL ENGINEERING (MBMCE 2015), 2015, : 757 - 762
  • [48] Investigating the Use of Formant Based Features for Detection of Affective Dimensions in Speech
    Kim, Jonathan C.
    Rao, Hrishikesh
    Clements, Mark A.
    AFFECTIVE COMPUTING AND INTELLIGENT INTERACTION, PT II, 2011, 6975 : 369 - 377
  • [49] Hybrid Transformer Architectures With Diverse Audio Features for Deepfake Speech Classification
    Zaman, Khalid
    Samiul, Islam J. A. M.
    Sah, Melike
    Direkoglu, Cem
    Okada, Shogo
    Unoki, Masashi
    IEEE ACCESS, 2024, 12 : 149221 - 149237
  • [50] Speech formant frequency estimation: evaluating a nonstationary analysis method
    Rao, P
    Das Barman, A
    SIGNAL PROCESSING, 2000, 80 (08) : 1655 - 1667